This dataset created by the FBI and obtained from the official website of UC Irvine, an association of American Universities, includes crime rates of 2215 communities and has a total of 147 variables. The dataset is called Communities and Crime and holds law enforcement data, socio-economic data and administrative data about communities in the United States. Since the variables are very diverse, they were recorded all around 1995. There are multiple target variables which are all continous and not binary. Therefore, it is necessary for this analysis to create new dependent variables which will be obtained by calculating the median of murderPerPop and ViolentCrimesPerPop. So two analysis will be conducted, one based on violence and the other on murder. This HMTL document will display the analysis for the murder target variable. However since both murderPerPop and ViolentCrimesPerPop come from the same dataset, the exploratory part is for the most part the same for both, except when examining the target variables’ representations in the independent variables.
rm(list = ls())
library("readxl")
crime <- read_excel("/Users/felipereyes/iCloud Drive (Archiv)/Documents/UNI/complutense/TFM/DATA.CommViolPredUnnormalizedData 2.xlsx")
#install.packages("")
library(skimr)
library(dplyr)
library(tidyverse)
library(ggthemes)
library(corrplot)
library(fastDummies)
library(corrr)
library(Hmisc)
library(parallel)
library(doParallel)
library(tidyverse)
library(tidymodels)
library(ggplot2)
library(gridExtra)
library(tidymodels)
library(outliers)
library(themis)
library(rpart.plot)
library(performance)
library(ggthemes)
library(glue)
library(ranger)
library(vip)
library(ggrepel)
library(Rmisc)
library(MASS)
library(MXM)
library(Boruta)
library(caret)
library(pROC)
library(gridExtra)
library(visualpred)
library(car)
library(readxl)
library(DataExplorer)
library(missForest)
library(plotly)
library (lubridate)
First off, it is crucial to investigate the dataset:
glimpse(crime)
Rows: 2,215
Columns: 147
$ communityname <chr> "BerkeleyHeightstownship", "Marpletown…
$ state <chr> "NJ", "PA", "OR", "NY", "MN", "MO", "M…
$ countyCode <chr> "39", "45", "?", "35", "7", "?", "21",…
$ communityCode <chr> "5320", "47616", "?", "29443", "5068",…
$ fold <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ population <dbl> 11980, 23123, 29344, 16656, 11245, 140…
$ householdsize <chr> "3.1", "2.82", "2.43", "2.4", "2.76", …
$ racepctblack <chr> "1.37", "0.8", "0.74", "1.7", "0.53", …
$ racePctWhite <chr> "91.78", "95.57", "94.33", "97.35", "8…
$ racePctAsian <chr> "6.5", "3.44", "3.43", "0.5", "1.17", …
$ racePctHisp <chr> "1.88", "0.85", "2.35", "0.7", "0.52",…
$ agePct12t21 <chr> "12.47", "11.01", "11.36", "12.55", "2…
$ agePct12t29 <chr> "21.44", "21.3", "25.88", "25.2", "40.…
$ agePct16t24 <chr> "10.93", "10.48", "11.01", "12.19", "2…
$ agePct65up <chr> "11.33", "17.18", "10.28", "17.57", "1…
$ numbUrban <dbl> 11980, 23123, 29344, 0, 0, 140494, 287…
$ pctUrban <chr> "100", "100", "100", "0", "0", "100", …
$ medIncome <dbl> 75122, 47917, 35669, 20580, 17390, 215…
$ pctWWage <chr> "89.24", "78.99", "82", "68.15", "69.3…
$ pctWFarmSelf <chr> "1.55", "1.11", "1.15", "0.24", "0.55"…
$ pctWInvInc <chr> "70.2", "64.11", "55.73", "38.95", "42…
$ pctWSocSec <chr> "23.62", "35.5", "22.25", "39.48", "32…
$ pctWPubAsst <chr> "1.03", "2.75", "2.94", "11.71", "11.2…
$ pctWRetire <chr> "18.39", "22.85", "14.56", "18.33", "1…
$ medFamInc <dbl> 79584, 55323, 42112, 26501, 24018, 277…
$ perCapInc <dbl> 29711, 20148, 16946, 10810, 8483, 1187…
$ whitePerCap <dbl> 30233, 20191, 17103, 10909, 9009, 1202…
$ blackPerCap <dbl> 13600, 18137, 16644, 9984, 887, 7382, …
$ indianPerCap <dbl> 5725, 0, 21606, 4941, 4425, 10264, 214…
$ AsianPerCap <dbl> 27101, 20074, 15528, 3541, 3352, 10753…
$ OtherPerCap <dbl> 5115, 5250, 5954, 2451, 3000, 7192, 21…
$ HispPerCap <dbl> 22838, 12222, 8405, 4391, 1328, 8104, …
$ NumUnderPov <dbl> 227, 885, 1389, 2831, 2855, 23223, 112…
$ PctPopUnderPov <chr> "1.96", "3.98", "4.75", "17.23", "29.9…
$ PctLess9thGrade <chr> "5.81", "5.61", "2.8", "11.05", "12.15…
$ PctNotHSGrad <chr> "9.9", "13.72", "9.09", "33.68", "23.0…
$ PctBSorMore <chr> "48.18", "29.89", "30.13", "10.81", "2…
$ PctUnemployed <chr> "2.7", "2.43", "4.01", "9.86", "9.08",…
$ PctEmploy <chr> "64.55", "61.96", "69.8", "54.74", "52…
$ PctEmplManu <chr> "14.65", "12.26", "15.95", "31.22", "6…
$ PctEmplProfServ <chr> "28.82", "29.28", "21.52", "27.43", "3…
$ PctOccupManu <chr> "5.49", "6.39", "8.79", "26.76", "10.9…
$ PctOccupMgmtProf <chr> "50.73", "37.64", "32.48", "22.71", "2…
$ MalePctDivorce <chr> "3.67", "4.23", "10.1", "10.98", "7.51…
$ MalePctNevMarr <chr> "26.38", "27.99", "25.78", "28.15", "5…
$ FemalePctDiv <chr> "5.22", "6.45", "14.76", "14.47", "11.…
$ TotalPctDiv <chr> "4.47", "5.42", "12.55", "12.91", "9.7…
$ PersPerFam <chr> "3.22", "3.11", "2.95", "2.98", "2.98"…
$ PctFam2Par <chr> "91.43", "86.91", "78.54", "64.02", "5…
$ PctKids2Par <chr> "90.17", "85.33", "78.85", "62.36", "5…
$ PctYoungKids2Par <chr> "95.78", "96.82", "92.37", "65.38", "6…
$ PctTeen2Par <chr> "95.81", "86.46", "75.72", "67.43", "7…
$ PctWorkMomYoungKids <chr> "44.56", "51.14", "66.08", "59.59", "6…
$ PctWorkMom <chr> "58.88", "62.43", "74.19", "70.27", "6…
$ NumKidsBornNeverMar <dbl> 31, 43, 164, 561, 402, 1511, 263, 2368…
$ PctKidsBornNeverMar <chr> "0.36", "0.24", "0.88", "3.84", "4.7",…
$ NumImmig <dbl> 1277, 1920, 1468, 339, 196, 2091, 2637…
$ PctImmigRecent <chr> "8.69", "5.21", "16.42", "13.86", "46.…
$ PctImmigRec5 <chr> "13", "8.65", "23.98", "13.86", "56.12…
$ PctImmigRec8 <chr> "20.99", "13.33", "32.08", "15.34", "6…
$ PctImmigRec10 <chr> "30.93", "22.5", "35.63", "15.34", "69…
$ PctRecentImmig <chr> "0.93", "0.43", "0.82", "0.28", "0.82"…
$ PctRecImmig5 <chr> "1.39", "0.72", "1.2", "0.28", "0.98",…
$ PctRecImmig8 <chr> "2.24", "1.11", "1.61", "0.31", "1.18"…
$ PctRecImmig10 <chr> "3.3", "1.87", "1.78", "0.31", "1.22",…
$ PctSpeakEnglOnly <chr> "85.68", "87.79", "93.11", "94.98", "9…
$ PctNotSpeakEnglWell <chr> "1.37", "1.81", "1.14", "0.56", "0.39"…
$ PctLargHouseFam <chr> "4.81", "4.25", "2.97", "3.93", "5.23"…
$ PctLargHouseOccup <chr> "4.17", "3.34", "2.05", "2.56", "3.11"…
$ PersPerOccupHous <chr> "2.99", "2.7", "2.42", "2.37", "2.35",…
$ PersPerOwnOccHous <chr> "3", "2.83", "2.69", "2.51", "2.55", "…
$ PersPerRentOccHous <chr> "2.84", "1.96", "2.06", "2.2", "2.12",…
$ PctPersOwnOccup <chr> "91.46", "89.03", "64.18", "58.18", "5…
$ PctPersDenseHous <chr> "0.39", "1.01", "2.03", "1.21", "2.94"…
$ PctHousLess3BR <chr> "11.06", "23.6", "47.46", "45.66", "55…
$ MedNumBR <dbl> 3, 3, 3, 3, 2, 2, 3, 2, 2, 2, 2, 2, 2,…
$ HousVacant <dbl> 64, 240, 544, 669, 333, 5119, 566, 205…
$ PctHousOccup <chr> "98.37", "97.15", "95.68", "91.19", "9…
$ PctHousOwnOcc <chr> "91.01", "84.88", "57.79", "54.89", "5…
$ PctVacantBoarded <chr> "3.12", "0", "0.92", "2.54", "3.9", "2…
$ PctVacMore6Mos <chr> "37.5", "18.33", "7.54", "57.85", "42.…
$ MedYrHousBuilt <dbl> 1959, 1958, 1976, 1939, 1958, 1966, 19…
$ PctHousNoPhone <chr> "0", "0.31", "1.55", "7", "7.45", "6.1…
$ PctWOFullPlumb <chr> "0.28", "0.14", "0.12", "0.87", "0.82"…
$ OwnOccLowQuart <dbl> 215900, 136300, 74700, 36400, 30600, 3…
$ OwnOccMedVal <dbl> 262600, 164200, 90400, 49600, 43200, 5…
$ OwnOccHiQuart <dbl> 326900, 199900, 112000, 66500, 59500, …
$ OwnOccQrange <dbl> 111000, 63600, 37300, 30100, 28900, 35…
$ RentLowQ <dbl> 685, 467, 370, 195, 202, 215, 463, 186…
$ RentMedian <dbl> 1001, 560, 428, 250, 283, 280, 669, 25…
$ RentHighQ <dbl> 1001, 672, 520, 309, 362, 349, 824, 32…
$ RentQrange <dbl> 316, 205, 150, 114, 160, 134, 361, 139…
$ MedRent <dbl> 1001, 627, 484, 333, 332, 340, 736, 33…
$ MedRentPctHousInc <chr> "23.8", "27.6", "24.1", "28.7", "32.2"…
$ MedOwnCostPctInc <chr> "21.1", "20.7", "21.7", "20.6", "23.2"…
$ MedOwnCostPctIncNoMtg <chr> "14", "12.5", "11.6", "14.5", "12.9", …
$ NumInShelters <dbl> 11, 0, 16, 0, 2, 327, 0, 21, 125, 43, …
$ NumStreet <dbl> 0, 0, 0, 0, 0, 4, 0, 0, 15, 4, 0, 49, …
$ PctForeignBorn <chr> "10.66", "8.3", "5", "2.04", "1.74", "…
$ PctBornSameState <chr> "53.72", "77.17", "44.77", "88.71", "7…
$ PctSameHouse85 <chr> "65.29", "71.27", "36.6", "56.7", "42.…
$ PctSameCity85 <chr> "78.09", "90.22", "61.26", "90.17", "6…
$ PctSameState85 <chr> "89.14", "96.12", "82.85", "96.24", "8…
$ LemasSwornFT <chr> "?", "?", "?", "?", "?", "?", "?", "?"…
$ LemasSwFTPerPop <chr> "?", "?", "?", "?", "?", "?", "?", "?"…
$ LemasSwFTFieldOps <chr> "?", "?", "?", "?", "?", "?", "?", "?"…
$ LemasSwFTFieldPerPop <chr> "?", "?", "?", "?", "?", "?", "?", "?"…
$ LemasTotalReq <chr> "?", "?", "?", "?", "?", "?", "?", "?"…
$ LemasTotReqPerPop <chr> "?", "?", "?", "?", "?", "?", "?", "?"…
$ PolicReqPerOffic <chr> "?", "?", "?", "?", "?", "?", "?", "?"…
$ PolicPerPop <chr> "?", "?", "?", "?", "?", "?", "?", "?"…
$ RacialMatchCommPol <chr> "?", "?", "?", "?", "?", "?", "?", "?"…
$ PctPolicWhite <chr> "?", "?", "?", "?", "?", "?", "?", "?"…
$ PctPolicBlack <chr> "?", "?", "?", "?", "?", "?", "?", "?"…
$ PctPolicHisp <chr> "?", "?", "?", "?", "?", "?", "?", "?"…
$ PctPolicAsian <chr> "?", "?", "?", "?", "?", "?", "?", "?"…
$ PctPolicMinor <chr> "?", "?", "?", "?", "?", "?", "?", "?"…
$ OfficAssgnDrugUnits <chr> "?", "?", "?", "?", "?", "?", "?", "?"…
$ NumKindsDrugsSeiz <chr> "?", "?", "?", "?", "?", "?", "?", "?"…
$ PolicAveOTWorked <chr> "?", "?", "?", "?", "?", "?", "?", "?"…
$ LandArea <chr> "6.5", "10.6", "10.6", "5.2", "11.5", …
$ PopDens <chr> "1845.9", "2186.7", "2780.9", "3217.7"…
$ PctUsePubTrans <chr> "9.63", "3.84", "4.37", "3.31", "0.38"…
$ PolicCars <chr> "?", "?", "?", "?", "?", "?", "?", "?"…
$ PolicOperBudg <chr> "?", "?", "?", "?", "?", "?", "?", "?"…
$ LemasPctPolicOnPatr <chr> "?", "?", "?", "?", "?", "?", "?", "?"…
$ LemasGangUnitDeploy <chr> "?", "?", "?", "?", "?", "?", "?", "?"…
$ LemasPctOfficDrugUn <chr> "0", "0", "0", "0", "0", "0", "0", "0"…
$ PolicBudgPerPop <chr> "?", "?", "?", "?", "?", "?", "?", "?"…
$ murders <dbl> 0, 0, 3, 0, 0, 7, 0, 8, 0, 29, 1, 12, …
$ murdPerPop <chr> "0", "0", "8.3", "0", "0", "4.63", "0"…
$ rapes <chr> "0", "1", "6", "10", "?", "77", "4", "…
$ rapesPerPop <chr> "0", "4.25", "16.6", "57.86", "?", "50…
$ robberies <chr> "1", "5", "56", "10", "4", "136", "9",…
$ robbbPerPop <chr> "8.2", "21.26", "154.95", "57.86", "32…
$ assaults <chr> "4", "24", "14", "33", "14", "449", "5…
$ assaultPerPop <chr> "32.81", "102.05", "38.74", "190.93", …
$ burglaries <chr> "14", "57", "274", "225", "91", "2094"…
$ burglPerPop <chr> "114.85", "242.37", "758.14", "1301.78…
$ larcenies <chr> "138", "376", "1797", "716", "1060", "…
$ larcPerPop <chr> "1132.08", "1598.78", "4972.19", "4142…
$ autoTheft <dbl> 16, 26, 136, 47, 91, 454, 144, 125, 20…
$ autoTheftPerPop <chr> "131.26", "110.55", "376.3", "271.93",…
$ arsons <chr> "2", "1", "22", "?", "5", "134", "17",…
$ arsonsPerPop <chr> "16.41", "4.25", "60.87", "?", "40.05"…
$ ViolentCrimesPerPop <chr> "41.02", "127.56", "218.59", "306.64",…
$ nonViolPerPop <chr> "1394.59", "1955.95", "6167.51", "?", …
| Name | crime |
| Number of rows | 2215 |
| Number of columns | 147 |
| _______________________ | |
| Column type frequency: | |
| character | 116 |
| numeric | 31 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| communityname | 0 | 1 | 7 | 28 | 0 | 2018 | 0 |
| state | 0 | 1 | 2 | 2 | 0 | 48 | 0 |
| countyCode | 0 | 1 | 1 | 3 | 0 | 115 | 0 |
| communityCode | 0 | 1 | 1 | 5 | 0 | 960 | 0 |
| householdsize | 0 | 1 | 1 | 4 | 0 | 198 | 0 |
| racepctblack | 0 | 1 | 1 | 5 | 0 | 1172 | 0 |
| racePctWhite | 0 | 1 | 2 | 5 | 0 | 1609 | 0 |
| racePctAsian | 0 | 1 | 1 | 5 | 0 | 667 | 0 |
| racePctHisp | 0 | 1 | 1 | 5 | 0 | 1026 | 0 |
| agePct12t21 | 0 | 1 | 2 | 5 | 0 | 950 | 0 |
| agePct12t29 | 0 | 1 | 2 | 5 | 0 | 1184 | 0 |
| agePct16t24 | 0 | 1 | 1 | 5 | 0 | 947 | 0 |
| agePct65up | 0 | 1 | 1 | 5 | 0 | 1221 | 0 |
| pctUrban | 0 | 1 | 1 | 5 | 0 | 293 | 0 |
| pctWWage | 0 | 1 | 2 | 5 | 0 | 1536 | 0 |
| pctWFarmSelf | 0 | 1 | 1 | 4 | 0 | 290 | 0 |
| pctWInvInc | 0 | 1 | 2 | 5 | 0 | 1774 | 0 |
| pctWSocSec | 0 | 1 | 2 | 5 | 0 | 1548 | 0 |
| pctWPubAsst | 0 | 1 | 1 | 5 | 0 | 1125 | 0 |
| pctWRetire | 0 | 1 | 2 | 5 | 0 | 1258 | 0 |
| PctPopUnderPov | 0 | 1 | 1 | 5 | 0 | 1462 | 0 |
| PctLess9thGrade | 0 | 1 | 1 | 5 | 0 | 1275 | 0 |
| PctNotHSGrad | 0 | 1 | 1 | 5 | 0 | 1690 | 0 |
| PctBSorMore | 0 | 1 | 2 | 5 | 0 | 1660 | 0 |
| PctUnemployed | 0 | 1 | 1 | 5 | 0 | 868 | 0 |
| PctEmploy | 0 | 1 | 2 | 5 | 0 | 1544 | 0 |
| PctEmplManu | 0 | 1 | 1 | 5 | 0 | 1543 | 0 |
| PctEmplProfServ | 0 | 1 | 2 | 5 | 0 | 1364 | 0 |
| PctOccupManu | 0 | 1 | 1 | 5 | 0 | 1429 | 0 |
| PctOccupMgmtProf | 0 | 1 | 1 | 5 | 0 | 1529 | 0 |
| MalePctDivorce | 0 | 1 | 1 | 5 | 0 | 948 | 0 |
| MalePctNevMarr | 0 | 1 | 2 | 5 | 0 | 1411 | 0 |
| FemalePctDiv | 0 | 1 | 1 | 5 | 0 | 1052 | 0 |
| TotalPctDiv | 0 | 1 | 1 | 5 | 0 | 994 | 0 |
| PersPerFam | 0 | 1 | 1 | 4 | 0 | 161 | 0 |
| PctFam2Par | 0 | 1 | 2 | 5 | 0 | 1664 | 0 |
| PctKids2Par | 0 | 1 | 2 | 5 | 0 | 1687 | 0 |
| PctYoungKids2Par | 0 | 1 | 2 | 5 | 0 | 1679 | 0 |
| PctTeen2Par | 0 | 1 | 2 | 5 | 0 | 1635 | 0 |
| PctWorkMomYoungKids | 0 | 1 | 2 | 5 | 0 | 1546 | 0 |
| PctWorkMom | 0 | 1 | 2 | 5 | 0 | 1433 | 0 |
| PctKidsBornNeverMar | 0 | 1 | 1 | 5 | 0 | 768 | 0 |
| PctImmigRecent | 0 | 1 | 1 | 5 | 0 | 1481 | 0 |
| PctImmigRec5 | 0 | 1 | 1 | 5 | 0 | 1658 | 0 |
| PctImmigRec8 | 0 | 1 | 1 | 5 | 0 | 1755 | 0 |
| PctImmigRec10 | 0 | 1 | 1 | 5 | 0 | 1808 | 0 |
| PctRecentImmig | 0 | 1 | 1 | 5 | 0 | 448 | 0 |
| PctRecImmig5 | 0 | 1 | 1 | 5 | 0 | 568 | 0 |
| PctRecImmig8 | 0 | 1 | 1 | 5 | 0 | 670 | 0 |
| PctRecImmig10 | 0 | 1 | 1 | 5 | 0 | 738 | 0 |
| PctSpeakEnglOnly | 0 | 1 | 2 | 5 | 0 | 1416 | 0 |
| PctNotSpeakEnglWell | 0 | 1 | 1 | 5 | 0 | 625 | 0 |
| PctLargHouseFam | 0 | 1 | 1 | 5 | 0 | 785 | 0 |
| PctLargHouseOccup | 0 | 1 | 1 | 5 | 0 | 680 | 0 |
| PersPerOccupHous | 0 | 1 | 1 | 4 | 0 | 185 | 0 |
| PersPerOwnOccHous | 0 | 1 | 1 | 4 | 0 | 185 | 0 |
| PersPerRentOccHous | 0 | 1 | 1 | 4 | 0 | 214 | 0 |
| PctPersOwnOccup | 0 | 1 | 2 | 5 | 0 | 1802 | 0 |
| PctPersDenseHous | 0 | 1 | 1 | 5 | 0 | 850 | 0 |
| PctHousLess3BR | 0 | 1 | 2 | 5 | 0 | 1755 | 0 |
| PctHousOccup | 0 | 1 | 2 | 5 | 0 | 1019 | 0 |
| PctHousOwnOcc | 0 | 1 | 2 | 5 | 0 | 1801 | 0 |
| PctVacantBoarded | 0 | 1 | 1 | 5 | 0 | 715 | 0 |
| PctVacMore6Mos | 0 | 1 | 2 | 5 | 0 | 1780 | 0 |
| PctHousNoPhone | 0 | 1 | 1 | 5 | 0 | 972 | 0 |
| PctWOFullPlumb | 0 | 1 | 1 | 4 | 0 | 181 | 0 |
| MedRentPctHousInc | 0 | 1 | 2 | 4 | 0 | 159 | 0 |
| MedOwnCostPctInc | 0 | 1 | 2 | 4 | 0 | 154 | 0 |
| MedOwnCostPctIncNoMtg | 0 | 1 | 2 | 4 | 0 | 87 | 0 |
| PctForeignBorn | 0 | 1 | 1 | 5 | 0 | 1180 | 0 |
| PctBornSameState | 0 | 1 | 2 | 5 | 0 | 1833 | 0 |
| PctSameHouse85 | 0 | 1 | 2 | 5 | 0 | 1690 | 0 |
| PctSameCity85 | 0 | 1 | 2 | 5 | 0 | 1623 | 0 |
| PctSameState85 | 0 | 1 | 2 | 5 | 0 | 1337 | 0 |
| LemasSwornFT | 0 | 1 | 1 | 5 | 0 | 221 | 0 |
| LemasSwFTPerPop | 0 | 1 | 1 | 7 | 0 | 344 | 0 |
| LemasSwFTFieldOps | 0 | 1 | 1 | 5 | 0 | 216 | 0 |
| LemasSwFTFieldPerPop | 0 | 1 | 1 | 7 | 0 | 343 | 0 |
| LemasTotalReq | 0 | 1 | 1 | 7 | 0 | 320 | 0 |
| LemasTotReqPerPop | 0 | 1 | 1 | 10 | 0 | 344 | 0 |
| PolicReqPerOffic | 0 | 1 | 1 | 6 | 0 | 337 | 0 |
| PolicPerPop | 0 | 1 | 1 | 6 | 0 | 323 | 0 |
| RacialMatchCommPol | 0 | 1 | 1 | 5 | 0 | 318 | 0 |
| PctPolicWhite | 0 | 1 | 1 | 5 | 0 | 318 | 0 |
| PctPolicBlack | 0 | 1 | 1 | 5 | 0 | 289 | 0 |
| PctPolicHisp | 0 | 1 | 1 | 5 | 0 | 224 | 0 |
| PctPolicAsian | 0 | 1 | 1 | 5 | 0 | 114 | 0 |
| PctPolicMinor | 0 | 1 | 1 | 5 | 0 | 314 | 0 |
| OfficAssgnDrugUnits | 0 | 1 | 1 | 4 | 0 | 68 | 0 |
| NumKindsDrugsSeiz | 0 | 1 | 1 | 2 | 0 | 16 | 0 |
| PolicAveOTWorked | 0 | 1 | 1 | 5 | 0 | 305 | 0 |
| LandArea | 0 | 1 | 1 | 6 | 0 | 572 | 0 |
| PopDens | 0 | 1 | 2 | 7 | 0 | 2162 | 0 |
| PctUsePubTrans | 0 | 1 | 1 | 5 | 0 | 746 | 0 |
| PolicCars | 0 | 1 | 1 | 4 | 0 | 194 | 0 |
| PolicOperBudg | 0 | 1 | 1 | 10 | 0 | 342 | 0 |
| LemasPctPolicOnPatr | 0 | 1 | 1 | 5 | 0 | 308 | 0 |
| LemasGangUnitDeploy | 0 | 1 | 1 | 2 | 0 | 4 | 0 |
| LemasPctOfficDrugUn | 0 | 1 | 1 | 5 | 0 | 266 | 0 |
| PolicBudgPerPop | 0 | 1 | 1 | 10 | 0 | 344 | 0 |
| murdPerPop | 0 | 1 | 1 | 5 | 0 | 906 | 0 |
| rapes | 0 | 1 | 1 | 4 | 0 | 172 | 0 |
| rapesPerPop | 0 | 1 | 1 | 6 | 0 | 1622 | 0 |
| robberies | 0 | 1 | 1 | 5 | 0 | 418 | 0 |
| robbbPerPop | 0 | 1 | 1 | 7 | 0 | 2061 | 0 |
| assaults | 0 | 1 | 1 | 5 | 0 | 574 | 0 |
| assaultPerPop | 0 | 1 | 1 | 7 | 0 | 2150 | 0 |
| burglaries | 0 | 1 | 1 | 5 | 0 | 910 | 0 |
| burglPerPop | 0 | 1 | 1 | 8 | 0 | 2200 | 0 |
| larcenies | 0 | 1 | 1 | 6 | 0 | 1455 | 0 |
| larcPerPop | 0 | 1 | 1 | 8 | 0 | 2212 | 0 |
| autoTheftPerPop | 0 | 1 | 1 | 7 | 0 | 2173 | 0 |
| arsons | 0 | 1 | 1 | 4 | 0 | 179 | 0 |
| arsonsPerPop | 0 | 1 | 1 | 6 | 0 | 1578 | 0 |
| ViolentCrimesPerPop | 0 | 1 | 1 | 7 | 0 | 1974 | 0 |
| nonViolPerPop | 0 | 1 | 1 | 8 | 0 | 2114 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| fold | 0 | 1 | 5.49 | 2.87 | 1 | 3.00 | 5 | 8.0 | 10 | ▇▇▇▇▇ |
| population | 0 | 1 | 53117.98 | 204620.25 | 10005 | 14366.00 | 22792 | 43024.0 | 7322564 | ▇▁▁▁▁ |
| numbUrban | 0 | 1 | 47734.72 | 205606.69 | 0 | 0.00 | 18041 | 41918.0 | 7322564 | ▇▁▁▁▁ |
| medIncome | 0 | 1 | 33984.70 | 13424.68 | 8866 | 23817.00 | 31441 | 41480.5 | 123625 | ▇▆▁▁▁ |
| medFamInc | 0 | 1 | 39857.06 | 14251.21 | 10447 | 29538.00 | 36678 | 46999.0 | 139008 | ▇▇▁▁▁ |
| perCapInc | 0 | 1 | 15603.52 | 6281.56 | 5237 | 11602.50 | 14101 | 17795.0 | 63302 | ▇▃▁▁▁ |
| whitePerCap | 0 | 1 | 16567.70 | 6346.84 | 5472 | 12610.50 | 15073 | 18609.5 | 68850 | ▇▂▁▁▁ |
| blackPerCap | 0 | 1 | 11541.75 | 9232.10 | 0 | 6742.50 | 9777 | 14526.0 | 212120 | ▇▁▁▁▁ |
| indianPerCap | 0 | 1 | 12229.19 | 14853.84 | 0 | 6345.00 | 9895 | 14757.5 | 480000 | ▇▁▁▁▁ |
| AsianPerCap | 0 | 1 | 14227.99 | 9881.27 | 0 | 8285.50 | 12250 | 17327.5 | 106165 | ▇▁▁▁▁ |
| OtherPerCap | 1 | 1 | 9442.77 | 7926.47 | 0 | 5528.25 | 8186 | 11525.5 | 137000 | ▇▁▁▁▁ |
| HispPerCap | 0 | 1 | 11019.00 | 5884.06 | 0 | 7274.00 | 9721 | 13418.0 | 54648 | ▇▅▁▁▁ |
| NumUnderPov | 0 | 1 | 7590.85 | 39361.46 | 78 | 912.50 | 2142 | 4988.0 | 1384994 | ▇▁▁▁▁ |
| NumKidsBornNeverMar | 0 | 1 | 2141.42 | 14692.58 | 0 | 147.00 | 352 | 1031.5 | 527557 | ▇▁▁▁▁ |
| NumImmig | 0 | 1 | 6277.27 | 55419.65 | 20 | 400.00 | 1024 | 3302.0 | 2082931 | ▇▁▁▁▁ |
| MedNumBR | 0 | 1 | 2.64 | 0.51 | 1 | 2.00 | 3 | 3.0 | 4 | ▁▅▁▇▁ |
| HousVacant | 0 | 1 | 1748.37 | 6503.87 | 36 | 304.50 | 558 | 1228.0 | 172768 | ▇▁▁▁▁ |
| MedYrHousBuilt | 0 | 1 | 1962.62 | 11.17 | 1939 | 1956.00 | 1964 | 1971.0 | 1987 | ▃▆▇▇▂ |
| OwnOccLowQuart | 0 | 1 | 88695.80 | 66670.78 | 14999 | 41500.00 | 65500 | 121500.0 | 500001 | ▇▂▁▁▁ |
| OwnOccMedVal | 0 | 1 | 113097.52 | 81906.36 | 19500 | 56200.00 | 82800 | 150600.0 | 500001 | ▇▃▁▁▁ |
| OwnOccHiQuart | 0 | 1 | 145318.26 | 99030.91 | 28200 | 74300.00 | 106700 | 188000.0 | 500001 | ▇▃▂▁▁ |
| OwnOccQrange | 0 | 1 | 56622.46 | 39106.50 | 0 | 32200.00 | 43400 | 65450.0 | 331000 | ▇▂▁▁▁ |
| RentLowQ | 0 | 1 | 329.97 | 144.14 | 99 | 213.50 | 307 | 421.0 | 1001 | ▇▇▃▁▁ |
| RentMedian | 0 | 1 | 428.54 | 170.71 | 120 | 289.50 | 397 | 544.0 | 1001 | ▆▇▅▂▁ |
| RentHighQ | 0 | 1 | 527.25 | 199.29 | 182 | 366.00 | 486 | 659.5 | 1001 | ▅▇▆▃▂ |
| RentQrange | 0 | 1 | 197.29 | 85.21 | 0 | 139.00 | 171 | 232.5 | 803 | ▇▇▂▁▁ |
| MedRent | 0 | 1 | 501.47 | 169.27 | 192 | 364.00 | 467 | 615.0 | 1001 | ▅▇▅▂▁ |
| NumInShelters | 0 | 1 | 66.95 | 564.25 | 0 | 0.00 | 0 | 22.0 | 23383 | ▇▁▁▁▁ |
| NumStreet | 0 | 1 | 17.82 | 245.45 | 0 | 0.00 | 0 | 1.0 | 10447 | ▇▁▁▁▁ |
| murders | 0 | 1 | 7.76 | 58.17 | 0 | 0.00 | 1 | 3.0 | 1946 | ▇▁▁▁▁ |
| autoTheft | 3 | 1 | 516.69 | 3258.16 | 1 | 30.00 | 75 | 232.5 | 112464 | ▇▁▁▁▁ |
After this small approach to the dataset, the first phase of the SEMMA methodology is the cleaning of our data. In this first section we will observe if there are any coding problems in the dataset. It can already be seen that 116 columns are classified as qualitative despite only having 4 real qualitative variables in the dataset, of which all refer to the geography (communityname, state, coutryCode, communityCode). This issue in typology probably has to do with the fact that we have “?” in the dataset that should be treated like Nulls. So we will replace them and then convert the columns’ formats. Furthermore, it can be observed that there are multiple columns that could function as target variables, such as autoTheft, rapesPerPop, nonViolPerPop etc. For this research, and as described in the introduction, we will focus on only two, namely ViolentCrimesPerPop and murdPerPop. On the basis of these two we will create new target variables. Therefore, the remaining ones present in the dataset can be excluded.
Lets filter the qualitative columns out and replace “?” with NA.
if (!requireNamespace("dplyr", quietly = TRUE)) {
install.packages("dplyr")
}
library(dplyr)
qual_var <- crime %>% select_if(is.character)
qual_var <- qual_var %>% mutate_all(~na_if(.,"?"))
qual_var
# A tibble: 2,215 × 116
communityname state countyCode communityCode householdsize
<chr> <chr> <chr> <chr> <chr>
1 BerkeleyHeightstownsh… NJ 39 5320 3.1
2 Marpletownship PA 45 47616 2.82
3 Tigardcity OR <NA> <NA> 2.43
4 Gloversvillecity NY 35 29443 2.4
5 Bemidjicity MN 7 5068 2.76
6 Springfieldcity MO <NA> <NA> 2.45
7 Norwoodtown MA 21 50250 2.6
8 Andersoncity IN <NA> <NA> 2.45
9 Fargocity ND 17 25700 2.46
10 Wacocity TX <NA> <NA> 2.62
# ℹ 2,205 more rows
# ℹ 111 more variables: racepctblack <chr>, racePctWhite <chr>,
# racePctAsian <chr>, racePctHisp <chr>, agePct12t21 <chr>,
# agePct12t29 <chr>, agePct16t24 <chr>, agePct65up <chr>,
# pctUrban <chr>, pctWWage <chr>, pctWFarmSelf <chr>,
# pctWInvInc <chr>, pctWSocSec <chr>, pctWPubAsst <chr>,
# pctWRetire <chr>, PctPopUnderPov <chr>, PctLess9thGrade <chr>, …
Lets exclude communityname, state, country code and communityCode from the new dataset in order to convert the rest into numeric values and then we put the qualitative variables back into the dataset, to maintain the original dataset under “crime” including all variables.
Now that they are all numeric, lets exclude the target variables we wont use, and also exclude the four qualitative variables that serve as identifier for the community once again and save this into a new dataset called “crime2”, to leave “crime” untouched.
As the NULL values are being classified correctly now, it is useful to investigate if they make up a big part of the overall values. Therefore, it will be checked which variables have many NULL values.
ausentes <-
apply(crime2, 2, function(x) sum(is.na(x)))
ausentes_tb <-
tibble(Variable = names(crime2), Ausentes = ausentes) |>
filter(Ausentes > 0)
ausentes_tb
# A tibble: 24 × 2
Variable Ausentes
<chr> <int>
1 LemasSwornFT 1872
2 LemasSwFTPerPop 1872
3 LemasSwFTFieldOps 1872
4 LemasSwFTFieldPerPop 1872
5 LemasTotalReq 1872
6 LemasTotReqPerPop 1872
7 PolicReqPerOffic 1872
8 PolicPerPop 1872
9 RacialMatchCommPol 1872
10 PctPolicWhite 1872
# ℹ 14 more rows
| Variable | NAs |
|---|---|
| LemasSwornFT | 1872 |
| LemasSwFTPerPop | 1872 |
| LemasSwFTFieldOps | 1872 |
| LemasSwFTFieldPerPop | 1872 |
| LemasTotalReq | 1872 |
| LemasTotReqPerPop | 1872 |
| PolicReqPerOffic | 1872 |
| PolicPerPop | 1872 |
| RacialMatchCommPol | 1872 |
| PctPolicWhite | 1872 |
| PctPolicBlack | 1872 |
| PctPolicHisp | 1872 |
| PctPolicAsian | 1872 |
| PctPolicMinor | 1872 |
| OfficAssgnDrugUnits | 1872 |
| NumKindsDrugsSeiz | 1872 |
| PolicAveOTWorked | 1872 |
| PolicCars | 1872 |
| PolicOperBudg | 1872 |
| LemasPctPolicOnPatr | 1872 |
| LemasGangUnitDeploy | 1872 |
| PolicBudgPerPop | 1872 |
| ViolentCrimesPerPop | 221 |
| OtherPerCap | 1 |
It can be seen that there are in total 24 columns with NAs, of which most make up a significant portion of the values, considering that each variable has 2215 observations. They make up to 85% of all their values. Almost all variables here refer to police related factors. Important to note is also that the target variable violence has several NAs, which is not ideal, however they make just up for 10% of the total values, instead of 85%. The general practice in the SEMMA methodology would be to impute the median, mean or mode, but with such high numbers, the newly imputed values probably wont be as representative. However, testing for collinearity wont be possible with such a high number of NAs, wherefore a good solution would be to do the imputation for now, and when checking for collinearity, trying to exclude always the variable with the highest amount of NAs in case there is a high collinearity. For this we will impute the median since the variables are quantitative continous.
The same method will be used for the violence target variable. It can be argued that a good solution would be to exclude the records with NAs, however this would reduce the sample size by 10% and considering that the number of observations is barely above the limit to train and obtain a robust model, each data point is of outmost importance and could contribute significant and unique information. In addition, there could be a pattern behind the empty values, meaning certain values of independent variables are more likely to have missing values. These values would be deleted, which could result in eliminating specific patterns in the data. We are working with real-life data obtained from a reliable authority (FBI), wherefore it can be asssumed that data is not missing randomly for a such important variable. So in order to maintain the diversity and the original sample size, the empty values in ViolentCrimesPerPop wont be deleted but imputed by the median to account for skewness and outliers.
Apart from nulls, also outliers have to be reviewed.
Furthermore, after having corrected the dataset for outliers and NAs, we will create the binary target variables.
For the following sections the variables will be divided into categories identified previously in this research (Economic State/Wealth, Demographic Characteristics, Police-Related Factors and Family Structure), in order to have a better vision and understanding. At the end of this section, two different datasets with the corresponding variables identified as most important will be created, one for each target variable (Violence and Murder).
Variables:
medIncome, medFamInc, perCapInc, whitePerCap, blackPerCap, indianPerCap, AsianPerCap, OtherPerCap, HispPerCap, pctWWage, pctWFarmSelf, pctWInvInc, pctWSocSec, pctWPubAsst, pctWRetire, OwnOccLowQuart, OwnOccMedVal, OwnOccHiQuart, OwnOccQrange, RentLowQ, RentMedian, RentHighQ, RentQrange, MedRent, MedRentPctHousInc, MedOwnCostPctInc, MedOwnCostPctIncNoMtg
box1 <-
ggplot(crime2, aes(murdPerPop, medIncome)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box2 <-
ggplot(crime2, aes(murdPerPop, medFamInc)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box3 <-
ggplot(crime2, aes(murdPerPop, perCapInc)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box4 <-
ggplot(crime2, aes(murdPerPop, whitePerCap)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box5 <-
ggplot(crime2, aes(murdPerPop, blackPerCap)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box6 <-
ggplot(crime2, aes(murdPerPop, indianPerCap)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box7 <-
ggplot(crime2, aes(murdPerPop, AsianPerCap)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box8 <-
ggplot(crime2, aes(murdPerPop, OtherPerCap)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box9 <-
ggplot(crime2, aes(murdPerPop, HispPerCap)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box10 <-
ggplot(crime2, aes(murdPerPop, pctWWage)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box11 <-
ggplot(crime2, aes(murdPerPop, pctWFarmSelf)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box13 <-
ggplot(crime2, aes(murdPerPop, pctWInvInc)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box14 <-
ggplot(crime2, aes(murdPerPop, pctWSocSec)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box15 <-
ggplot(crime2, aes(murdPerPop, pctWPubAsst)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box16 <-
ggplot(crime2, aes(murdPerPop, pctWRetire)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box17 <-
ggplot(crime2, aes(murdPerPop, OwnOccLowQuart)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box18 <-
ggplot(crime2, aes(murdPerPop, OwnOccMedVal)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box19 <-
ggplot(crime2, aes(murdPerPop, OwnOccHiQuart)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box20 <-
ggplot(crime2, aes(murdPerPop, OwnOccQrange)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box21 <-
ggplot(crime2, aes(murdPerPop, RentLowQ)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box22 <-
ggplot(crime2, aes(murdPerPop, RentMedian)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box23 <-
ggplot(crime2, aes(murdPerPop, RentHighQ)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box24 <-
ggplot(crime2, aes(murdPerPop, RentQrange)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box25 <-
ggplot(crime2, aes(murdPerPop, MedRent)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box26 <-
ggplot(crime2, aes(murdPerPop, MedRentPctHousInc)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box27 <-
ggplot(crime2, aes(murdPerPop, MedOwnCostPctInc)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box28 <-
ggplot(crime2, aes(murdPerPop, MedOwnCostPctIncNoMtg)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
multiplot(box1, box2, box3, box4, box5, box6, box7, box8, box9, box10, cols = 2)
multiplot(box11, box13, box14, box15, box16, box17, box18, box19, box20,cols = 2)
multiplot( box21, box22, box23, box24, box25, box26, box27, box28, cols = 2)
If we look at these box plots, all the variables referring to economic state are asymmetrical, so outliers are being detected and, just as missing values, will be imputed by the median. The only three variables that do not seem to have outliers or at least very few are pctWInvInc, RentHighQ and MedRent.
Variables:
population, householdsize, racepctblack, racePctWhite, racePctAsian, racePctHisp, agePct12t21, agePct12t29, agePct16t24, agePct65up, numbUrban, pctUrban, PctImmigRecent, PctImmigRec5, PctImmigRec8, PctImmigRec10, PctRecentImmig, PctRecImmig5, PctRecImmig8, PctRecImmig10, PctSpeakEnglOnly, PctNotSpeakEnglWell, PctForeignBorn, PctBornSameState, PctSameHouse85, PctSameCity85, PctSameState85
box1 <-
ggplot(crime2, aes(murdPerPop, population)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box2 <-
ggplot(crime2, aes(murdPerPop, householdsize)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box3 <-
ggplot(crime2, aes(murdPerPop, racepctblack)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box4 <-
ggplot(crime2, aes(murdPerPop, racePctAsian)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box5 <-
ggplot(crime2, aes(murdPerPop, racePctHisp)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box6 <-
ggplot(crime2, aes(murdPerPop, agePct12t21)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box7 <-
ggplot(crime2, aes(murdPerPop, agePct12t29)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box8 <-
ggplot(crime2, aes(murdPerPop, agePct16t24)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box9 <-
ggplot(crime2, aes(murdPerPop, agePct65up)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box10 <-
ggplot(crime2, aes(murdPerPop, numbUrban)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box11 <-
ggplot(crime2, aes(murdPerPop, pctUrban)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box13 <-
ggplot(crime2, aes(murdPerPop, PctImmigRecent)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box14 <-
ggplot(crime2, aes(murdPerPop, PctImmigRec5)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box15 <-
ggplot(crime2, aes(murdPerPop, PctImmigRec8)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box16 <-
ggplot(crime2, aes(murdPerPop, PctImmigRec10)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box17 <-
ggplot(crime2, aes(murdPerPop, PctRecentImmig)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box18 <-
ggplot(crime2, aes(murdPerPop, PctRecImmig5)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box19 <-
ggplot(crime2, aes(murdPerPop, PctRecImmig8)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box20 <-
ggplot(crime2, aes(murdPerPop, PctRecImmig10)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box21 <-
ggplot(crime2, aes(murdPerPop, PctSpeakEnglOnly)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box22 <-
ggplot(crime2, aes(murdPerPop, PctNotSpeakEnglWell)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box23 <-
ggplot(crime2, aes(murdPerPop, PctForeignBorn)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box24 <-
ggplot(crime2, aes(murdPerPop, PctBornSameState)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box25 <-
ggplot(crime2, aes(murdPerPop, PctSameHouse85)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box26 <-
ggplot(crime2, aes(murdPerPop, PctSameCity85)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box27 <-
ggplot(crime2, aes(murdPerPop, PctSameState85)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box28 <-
ggplot(crime2, aes(murdPerPop, racePctWhite)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
multiplot(box1, box2, box3, box4, box5, box6, box7, box8, box9, box10, cols = 2)
multiplot(box11, box13, box14, box15, box16, box17, box18, box19, box20,cols = 2)
multiplot( box21, box22, box23, box24, box25, box26, box27, box28, cols = 2)
Once again almost all variables have outliers. pctUrban, instead of outliers, has a very high number of values in the extremes of the general distribution within the variable. We see the median is on the right extreme side which indicates that the distribution is heavily skewed, since many values are concentrated there. In general most variables are strongly skewed to either the left or right and have a high number of outliers.
box1 <-
ggplot(crime2, aes(murdPerPop, LemasSwornFT)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box2 <-
ggplot(crime2, aes(murdPerPop, LemasSwFTPerPop)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box3 <-
ggplot(crime2, aes(murdPerPop, LemasSwFTFieldOps)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box4 <-
ggplot(crime2, aes(murdPerPop, LemasSwFTFieldPerPop)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box5 <-
ggplot(crime2, aes(murdPerPop, LemasTotalReq)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box6 <-
ggplot(crime2, aes(murdPerPop, LemasTotReqPerPop)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box7 <-
ggplot(crime2, aes(murdPerPop, PolicReqPerOffic)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box8 <-
ggplot(crime2, aes(murdPerPop, PolicPerPop)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box9 <-
ggplot(crime2, aes(murdPerPop, RacialMatchCommPol)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box10 <-
ggplot(crime2, aes(murdPerPop, PctPolicWhite)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box11 <-
ggplot(crime2, aes(murdPerPop, PctPolicBlack)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box13 <-
ggplot(crime2, aes(murdPerPop, PctPolicHisp)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box14 <-
ggplot(crime2, aes(murdPerPop, PctPolicAsian)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box15 <-
ggplot(crime2, aes(murdPerPop, PctPolicMinor)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box16 <-
ggplot(crime2, aes(murdPerPop, OfficAssgnDrugUnits)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box17 <-
ggplot(crime2, aes(murdPerPop, NumKindsDrugsSeiz)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box18 <-
ggplot(crime2, aes(murdPerPop, PolicAveOTWorked)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box19 <-
ggplot(crime2, aes(murdPerPop, LandArea)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box20 <-
ggplot(crime2, aes(murdPerPop, PopDens)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box21 <-
ggplot(crime2, aes(murdPerPop, PctUsePubTrans)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box22 <-
ggplot(crime2, aes(murdPerPop, PolicCars)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box23 <-
ggplot(crime2, aes(murdPerPop, PolicOperBudg)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box24 <-
ggplot(crime2, aes(murdPerPop, LemasPctPolicOnPatr)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box25 <-
ggplot(crime2, aes(murdPerPop, LemasGangUnitDeploy)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box26 <-
ggplot(crime2, aes(murdPerPop, LemasPctOfficDrugUn)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box27 <-
ggplot(crime2, aes(murdPerPop, PolicBudgPerPop)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
multiplot(box1, box2, box3, box4, box5, box6, box7, box8, box9, box10, cols = 2)
multiplot(box11, box13, box14, box15, box16, box17, box18, box19, box20,cols = 2)
multiplot( box21, box22, box23, box24, box25, box26, box27, cols = 2)
Once again, there are outliers present in almost all of the variables. The only one without outliers is LemasGangUnitDeploy, which has however the most interesting distribution, since it has an exceptional high variability.
Variables:
MalePctDivorce, MalePctNevMarr, FemalePctDiv, TotalPctDiv, PersPerFam, PctFam2Par, PctKids2Par, PctYoungKids2Par, PctTeen2Par, PctWorkMomYoungKids, PctWorkMom, NumKidsBornNeverMar, PctKidsBornNeverMar
box1 <-
ggplot(crime2, aes(murdPerPop, MalePctDivorce)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box2 <-
ggplot(crime2, aes(murdPerPop, MalePctNevMarr)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box3 <-
ggplot(crime2, aes(murdPerPop, FemalePctDiv)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box4 <-
ggplot(crime2, aes(murdPerPop, TotalPctDiv)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box5 <-
ggplot(crime2, aes(murdPerPop, PersPerFam)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box6 <-
ggplot(crime2, aes(murdPerPop, PctFam2Par)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box7 <-
ggplot(crime2, aes(murdPerPop, PctKids2Par)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box8 <-
ggplot(crime2, aes(murdPerPop, PctYoungKids2Par)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box9 <-
ggplot(crime2, aes(murdPerPop, PctTeen2Par)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box10 <-
ggplot(crime2, aes(murdPerPop, PctWorkMomYoungKids)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box11 <-
ggplot(crime2, aes(murdPerPop, PctWorkMom)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box13 <-
ggplot(crime2, aes(murdPerPop, NumKidsBornNeverMar)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
box14 <-
ggplot(crime2, aes(murdPerPop, PctKidsBornNeverMar)) +
geom_boxplot() +
theme_minimal() +
coord_flip()
multiplot(box1, box2, box3, box4, box5, box6, box7, cols = 2)
multiplot( box8, box9, box10, box11, box13, box14, cols = 2)
It can be observed that the only variables that seem to be evenly distributed are TotalPctDiv and and FemalePctDiv, as their median is in the center and whiskers have similar length. However they still have outliers, althought considerably less than the remaining variables.
Now that we have identified NAs and have tested for outliers, we have come to the conclusion that almost all variables have outliers. Therefore, we will have to correct these. For the outiers we will use z type with 2.5 in order to detect outliers accordingly and then impute, as mentioned above, the median. In the same step, the NAs present in the variables will be fixed as well.
crime2 <-
crime2 |>
# Detect outliers and label them as NAs
dplyr::mutate(across(all_numeric_predictors(), function(x) { ifelse(abs(scores(x, type = "z")) > 2.5 & !is.na(x), NA, x) })) |>
# Impute NAs
mutate_if(is.numeric, ~ifelse(is.na(.), median(., na.rm = TRUE), .)) |>
mutate_if(is.character, ~ifelse(is.na(.), mode(., na.rm = TRUE), .))
Let’s focus on the target variables.
We must create the new target variables for which it is crucial to know whether to base the new binary target variables that must be calculated, on the mean, median or mode of the values. For this, the distribution of ViolentCrimesPerPop and murdPerPop must be reviewed.
hist(crime2$ViolentCrimesPerPop, main="Distribution Voilence", col="red")
hist(crime2$murdPerPop, main="Distribution Murder", col="red")
We see that both are highly skewed to the right, wherefore calculating the median would be the best solution, so the target variable is as balanced as possible.
median <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
crime2_v <- crime2 %>% select(-c(ViolentCrimesPerPop, murdPerPop))
crime2_m <- crime2 %>% select(-c(ViolentCrimesPerPop, murdPerPop))
median_v <- median(crime2$ViolentCrimesPerPop)
crime2_v$violence <- ifelse(crime2$ViolentCrimesPerPop > median_v, 1, 0)
median_m <- mean(crime2$murdPerPop)
crime2_m$murder <- ifelse(crime2$murdPerPop > median_m, 1, 0)
crime2 <- crime2 %>% select(-c(ViolentCrimesPerPop, murdPerPop))
variable <- crime2_v %>% select(violence)
variable2 <- crime2_m %>% select(murder)
crime2 <- cbind(crime2, variable, variable2)
##crime2_v is dataset with all variables plus violence target variable
## crime2_m is dataset with all variables plus murder target variable
## crime2 is dataset with all variables plus both target variable
We created the two datasets for the two target variables: crime2_v and crime2_m.
Lets confirm that the balance of the values for the two target variables is acceptable and at the same time find out which is the minority and majority class. This way we will also know which one will serve as Accuracy Base and which one will serve as Error Rate Base.
Violence:
set.seed(123)
# violence
table_hd <- table_hd <- table(crime2_v$violence)
base_accuracy <- round((table_hd[2]/sum(table_hd))*100, 2)
cat("Reference base Error rate:", base_accuracy, "%\n")
Reference base Error rate: 45.01 %
ref_base_failure_rate <- round((table_hd[1]/sum(table_hd))*100, 2)
cat("Base accuracy:", ref_base_failure_rate, "%\n")
Base accuracy: 54.99 %
sum(crime2_v$violence == 0)
[1] 1218
sum(crime2_v$violence == 1)
[1] 997
| Violence | Accuracy Base | Error Rate Base |
|---|---|---|
| 0 | 54.99 % | - |
| 1 | - | 45.01 % |
Murder:
# murder
table_hd <- table_hd <- table(crime2_m$murder)
base_accuracy <- round((table_hd[2]/sum(table_hd))*100, 2)
cat("Reference base Error rate:", base_accuracy, "%\n")
Reference base Error rate: 34.36 %
ref_base_failure_rate <- round((table_hd[1]/sum(table_hd))*100, 2)
cat("Base accuracy:", ref_base_failure_rate, "%\n")
Base accuracy: 65.64 %
sum(crime2_m$murder == 0)
[1] 1454
sum(crime2_m$murder == 1)
[1] 761
| Murder | Accuracy Base | Error Rate Base |
|---|---|---|
| 0 | 65.64% | - |
| 1 | - | 34.36% |
The class balance for murder is worse, so oversampling has to be done.
library(ROSE)
# Perform random oversampling
set.seed(1234)
crime2_m_balanced <- ovun.sample(factor(murder) ~ ., data = crime2_m, method = "over", N = 2 * table(crime2_m$murder)[1])$data
table(crime2_m_balanced$murder)
0 1
1454 1454
# murder
table_hd <- table_hd <- table(crime2_m_balanced$murder)
base_accuracy <- round((table_hd[2]/sum(table_hd))*100, 2)
cat("Reference base Error rate:", base_accuracy, "%\n")
Reference base Error rate: 50 %
ref_base_failure_rate <- round((table_hd[1]/sum(table_hd))*100, 2)
cat("Base accuracy:", ref_base_failure_rate, "%\n")
Base accuracy: 50 %
sum(crime2_m_balanced$murder == 0)
[1] 1454
sum(crime2_m_balanced$murder == 1)
[1] 1454
crime2_m <- crime2_m_balanced
It can be seen that the variables for both target variables are well balanced now.
Now that we fixed outliers and NAs and have our target variables ready, we will check for collinearity to see which variables can be excluded and also check for the representation of the target variable in the independent variables. Similar to the outliers section, we will divide this section into the different variable groups.
For collinearity, a value equal or above 0.5 can be considered as high and at least one of the affected variables must be excluded.
cor_matrix <-
crime2 |>
dplyr::select(medIncome, medFamInc, perCapInc, whitePerCap, blackPerCap, indianPerCap, AsianPerCap, OtherPerCap, HispPerCap, pctWWage, pctWFarmSelf, pctWInvInc, pctWSocSec, pctWPubAsst, pctWRetire, OwnOccLowQuart, OwnOccMedVal, OwnOccHiQuart, OwnOccQrange, RentLowQ, RentMedian, RentHighQ, RentQrange, MedRent, MedRentPctHousInc, MedOwnCostPctInc, MedOwnCostPctIncNoMtg) |>
cor() |>
round(2)
corrplot::corrplot(cor_matrix, method = "number", tl.cex = 0.5, number.cex = 0.4, type = "lower")
Lets filter for values equal or above 0.5.
filtered_matrix <- abs(cor_matrix)
filtered_matrix[cor_matrix <= 0.49] <- NA
corrplot(filtered_matrix, method = "number", type = "lower",tl.cex = 0.55, number.cex = 0.5, na.label=" ")
we will delete: medincome, medFamInc, whitePerCap, PctWFarmSelf, PctWInvInc, PctWSocSec, PctWPubAsst, PctWRetire, OwnOccLowQuart, OwnOccMedVa, RentLowQ, RentMedian, RentHighQ, RentQrange, MedRent, HispPerCap, OwnOccHiQuart, OwnOccQrange
Lets check that there is no collinearity equal or above 0.5 now:
cor_matrix <-
crime2 |>
dplyr::select(perCapInc, blackPerCap, indianPerCap, AsianPerCap, OtherPerCap, pctWWage, MedRentPctHousInc, MedOwnCostPctInc, MedOwnCostPctIncNoMtg) |>
cor() |>
round(2)
filtered_matrix <- abs(cor_matrix)
filtered_matrix[cor_matrix <= 0.49] <- NA
corrplot(filtered_matrix, method = "number", type = "lower",tl.cex = 0.55, number.cex = 0.5, na.label=" ")
Adding the remaining variables to dataset:
crime2_m_b <- crime2_m
crime2_m <-crime2_m |> select(perCapInc, blackPerCap, indianPerCap, AsianPerCap, OtherPerCap, pctWWage, MedRentPctHousInc, MedOwnCostPctInc, MedOwnCostPctIncNoMtg, murder)
library(ggtext)
age_freq <- table(crime2_m$perCapInc, crime2_m$murder)
age_freq <- as.data.frame(age_freq)
colnames(age_freq) <- c("perCapInc", "murder", "FRECUENCIA")
a1 <- crime2_m |> ggplot(aes(x = murder, y = perCapInc, fill = factor(murder))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Murder in perCapInc")
a2 <- crime2_m |> ggplot(aes(x = murder, y = blackPerCap, fill = factor(murder))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Murder in blackPerCap")
a3 <- crime2_m |> ggplot(aes(x = murder, y = indianPerCap, fill = factor(murder))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Murder in indianPerCap")
a4 <- crime2_m |> ggplot(aes(x = murder, y = AsianPerCap, fill = factor(murder))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Murder in AsianPerCap")
a5 <- crime2_m |> ggplot(aes(x = murder, y = OtherPerCap, fill = factor(murder))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Murder in OtherPerCap")
a6 <- crime2_m |> ggplot(aes(x = murder, y = pctWWage, fill = factor(murder))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Murder in pctWWage")
a7 <- crime2_m |> ggplot(aes(x = murder, y = MedRentPctHousInc, fill = factor(murder))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Murder in MedRentPctHousInc")
a8 <- crime2_m |> ggplot(aes(x = murder, y = MedOwnCostPctInc, fill = factor(murder))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Murder in MedOwnCostPctInc")
a9 <- crime2_m |> ggplot(aes(x = murder, y = MedOwnCostPctIncNoMtg, fill = factor(murder))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Murder in MedOwnCostPctIncNoMtg")
multiplot(a1, a2, a3, a4 , cols = 2)
multiplot(a5, a6, a7, a8, a9, cols = 2)
We can see that in each variable the values of violence only have small differences in their distribution, since the for 0 and 1 the box plots show a high degree of similarities. However, for perCapInc it can be seen that the variance is higher for the values belonging to 0 and also more outliers are present. This also holds for AsianPerCap and pctWWage.
cor_matrix <-
crime2 |>
dplyr::select(population, householdsize, racepctblack, racePctWhite, racePctAsian, racePctHisp, agePct12t21, agePct12t29, agePct16t24, agePct65up, numbUrban, pctUrban, PctImmigRecent, PctImmigRec5, PctImmigRec8, PctImmigRec10, PctRecentImmig, PctRecImmig5, PctRecImmig8, PctRecImmig10, PctSpeakEnglOnly, PctNotSpeakEnglWell, PctForeignBorn, PctBornSameState, PctSameHouse85, PctSameCity85, PctSameState85) |>
cor() |>
round(2)
corrplot::corrplot(cor_matrix, method = "number", tl.cex = 0.5, number.cex = 0.4, type = "lower")
filtered_matrix <- abs(cor_matrix)
filtered_matrix[cor_matrix <= 0.49] <- NA
corrplot(filtered_matrix, method = "number", type = "lower",tl.cex = 0.55, number.cex = 0.5, na.label=" ")
variables to delete: agePct12t21, racePctAsian, numbUrban, agePct16t24, PctRecImmig10, PctRecImmig5, PctImmigRec5, PctImmigRec10, PctNotSpeakEnglWell, PctSameCity85, PctSameState85, racePctHisp, PctImmigRec8, PctRecentImmig , PctRecImmig8
In general we will delete variables that have absolute values (numbers) instead of percentages (e.g. delete numbUrban instead of pctUrban). We also chose PctImmigRec8 and PctRecImmig8 to leave in, since they refer to the immigration in the last 8 years which is between the other two variables (PctImmigRec5 & PctImmigRec10 and PctRecImmig5 & PctRecImmig10 -> immigration in the last 5 years and 10 years).
cor_matrix <-
crime2 |>
dplyr::select(population, racepctblack, racePctWhite, agePct12t29, PctImmigRecent, PctSpeakEnglOnly, PctForeignBorn, PctBornSameState, PctSameHouse85) |>
cor() |>
round(2)
filtered_matrix <- abs(cor_matrix)
filtered_matrix[cor_matrix <= 0.49] <- NA
corrplot(filtered_matrix, method = "number", type = "lower",tl.cex = 0.55, number.cex = 0.5, na.label=" ")
Adding the chosen variables to dataset:
a1 <- crime2_m |> ggplot(aes(x = murder, y = population, fill = factor(murder))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Murder in population")
a2 <- crime2_m |> ggplot(aes(x = murder, y = racepctblack, fill = factor(murder))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Murder in racepctblack")
a3 <- crime2_m |> ggplot(aes(x = murder, y = racePctWhite, fill = factor(murder))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Murder in racePctWhite")
a4 <- crime2_m |> ggplot(aes(x = murder, y = agePct12t29, fill = factor(murder))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Murder in agePct12t29")
a5 <- crime2_m |> ggplot(aes(x = murder, y = PctImmigRecent, fill = factor(murder))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Murder in PctImmigRecent")
a6 <- crime2_m |> ggplot(aes(x = murder, y = PctSpeakEnglOnly, fill = factor(murder))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Murder in PctSpeakEnglOnly")
a7 <- crime2_m |> ggplot(aes(x = murder, y = PctForeignBorn, fill = factor(murder))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Murder in PctForeignBorn")
a8 <- crime2_m |> ggplot(aes(x = murder, y = PctBornSameState, fill = factor(murder))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Murder in PctBornSameState")
a9 <- crime2_m |> ggplot(aes(x = murder, y = PctSameHouse85, fill = factor(murder))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Murder in PctSameHouse85")
multiplot(a1, a2, a3, a4 , cols = 2)
multiplot(a5, a6, a7, a8, a9, cols = 2)
Looking at the boxplots, it can be observed that in most cases there is a difference in median, variance and outliers between the two values of the target variable. For racepctblack for example the higher median for the value 1, indicates that communities with higher rate of African Americans seem to have above-average murder rates. The opposite holds for racePCtWhite where, despite being more clustered, the median for the percentage of Caucasians in below-average dangerous communities is close to 100.
As seen before, essentially all NA values in the original dataset were present in the variables addressing police-related factors.
cor_matrix <-
crime2 |>
dplyr::select(LemasSwornFT, LemasSwFTPerPop, LemasSwFTFieldOps, LemasSwFTFieldPerPop, LemasTotalReq, LemasTotReqPerPop, PolicReqPerOffic, PolicPerPop, RacialMatchCommPol, PctPolicWhite, PctPolicBlack, PctPolicHisp, PctPolicAsian, PctPolicMinor, OfficAssgnDrugUnits, NumKindsDrugsSeiz, PolicAveOTWorked, LandArea, PopDens, PctUsePubTrans, PolicCars, PolicOperBudg, LemasPctPolicOnPatr, LemasGangUnitDeploy, LemasPctOfficDrugUn, PolicBudgPerPop) |>
cor() |>
round(2)
corrplot::corrplot(cor_matrix, method = "number", tl.cex = 0.5, number.cex = 0.4, type = "lower")
# Select only the entries of the correlation matrix with values greater than 0.49
filtered_matrix <- abs(cor_matrix)
filtered_matrix[cor_matrix <= 0.49] <- NA
corrplot(filtered_matrix, method = "number", type = "lower",tl.cex = 0.55, number.cex = 0.5, na.label=" ")
As mentioned above, when having the choice, we will try to delete the variables that had a lot of NAs.
to delete: LemasTotReqPerPop,LemasSwornFT, LemasSwFTPerPop, LemasSwFTFieldOps,LemasSwFTFieldPerPop, LemasTotalReq, PctPolicBlack, PctPolicHisp, OfficAssgnDrugUnits, PopDens, PolicBudgPerPop, PolicCars
We are keeping: PolicReqPerOffic, PolicPerPop, RacialMatchCommPol, PctPolicWhite, PctPolicAsian, PctPolicMinor, NumKindsDrugsSeiz, PolicAveOTWorked, LandArea, PctUsePubTrans, PolicOperBudg, LemasPctPolicOnPatr, LemasGangUnitDeploy, LemasPctOfficDrugUn
It was decided to delete popdense and keep PctUsePubTrans because popdense (population density) appears in similar format in other categories and PctUsePubTrans (Percentage using public transport) seems like an interesting variable that has not been considered.
Unfortunately, we still have to keep variables that had high amounts of NAs such as PolicReqPerOffic, PolicPerPop or PctPolicWhite.
cor_matrix <-
crime2 |>
dplyr::select(PolicReqPerOffic, PolicPerPop, RacialMatchCommPol, PctPolicWhite, PctPolicAsian, PctPolicMinor, NumKindsDrugsSeiz, PolicAveOTWorked, LandArea, PctUsePubTrans, PolicOperBudg, LemasPctPolicOnPatr, LemasGangUnitDeploy, LemasPctOfficDrugUn) |>
cor() |>
round(2)
filtered_matrix <- abs(cor_matrix)
filtered_matrix[cor_matrix <= 0.49] <- NA
corrplot(filtered_matrix, method = "number", type = "lower",tl.cex = 0.55, number.cex = 0.5, na.label=" ")
add chosen variables to dataset:
new_var <- crime2_m_b |> select(PolicReqPerOffic, PolicPerPop, RacialMatchCommPol, PctPolicWhite, PctPolicAsian, PctPolicMinor, NumKindsDrugsSeiz, PolicAveOTWorked, LandArea, PctUsePubTrans, PolicOperBudg, LemasPctPolicOnPatr, LemasGangUnitDeploy, LemasPctOfficDrugUn)
crime2_m <- cbind(new_var, crime2_m)
murder:
a1 <- crime2_m |> ggplot(aes(x = murder, y = PolicReqPerOffic, fill = factor(murder))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Murder in PolicReqPerOffic")
a2 <- crime2_m |> ggplot(aes(x = murder, y = PolicPerPop, fill = factor(murder))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Murder in PolicPerPop")
a3 <- crime2_m |> ggplot(aes(x = murder, y = RacialMatchCommPol, fill = factor(murder))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Murder in RacialMatchCommPol")
a4 <- crime2_m |> ggplot(aes(x = murder, y = PctPolicWhite, fill = factor(murder))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Murder in PctPolicWhite")
a5 <- crime2_m |> ggplot(aes(x = murder, y = PctPolicAsian, fill = factor(murder))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Murder in PctPolicAsian")
a6 <- crime2_m |> ggplot(aes(x = murder, y = PctPolicMinor, fill = factor(murder))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Murder in PctPolicMinor")
a7 <- crime2_m |> ggplot(aes(x = murder, y = NumKindsDrugsSeiz, fill = factor(murder))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Murder in NumKindsDrugsSeiz")
a8 <- crime2_m |> ggplot(aes(x = murder, y = PolicAveOTWorked, fill = factor(murder))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Murder in PolicAveOTWorked")
a9 <- crime2_m |> ggplot(aes(x = murder, y = LandArea, fill = factor(murder))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Murder in LandArea")
a10 <- crime2_m |> ggplot(aes(x = murder, y = PctUsePubTrans, fill = factor(murder))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Murder in PctUsePubTrans")
a11 <- crime2_m |> ggplot(aes(x = murder, y = PolicOperBudg, fill = factor(murder))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Murder in PolicOperBudg")
a12 <- crime2_m |> ggplot(aes(x = murder, y = LemasPctPolicOnPatr, fill = factor(murder))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Murder in LemasPctPolicOnPatr")
a13 <- crime2_m |> ggplot(aes(x = murder, y = LemasGangUnitDeploy, fill = factor(murder))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Murder in LemasGangUnitDeploy")
a14 <- crime2_m |> ggplot(aes(x = murder, y = LemasPctOfficDrugUn, fill = factor(murder))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Murder in LemasPctOfficDrugUn")
multiplot(a1, a2, a3, a4 , cols = 2)
multiplot(a5, a6, a7, a8, a9, cols = 2)
multiplot(a10, a11, a12, a13, a14, cols = 2)
Unfortunately, most boxplots turn out to be highly clustered and very similar among the two values of the target variable. This indicates that they have little predictive power, which however can be attributed to the high amount of NAs found in most of them. It highlights one more time the importance to have eliminated most of them during the collinearity check.
cor_matrix <-
crime2 |>
dplyr::select(MalePctDivorce, MalePctNevMarr, FemalePctDiv, TotalPctDiv, PersPerFam, PctFam2Par, PctKids2Par, PctYoungKids2Par, PctTeen2Par, PctWorkMomYoungKids, PctWorkMom, NumKidsBornNeverMar, PctKidsBornNeverMar) |>
cor() |>
round(2)
filtered_matrix <- abs(cor_matrix)
filtered_matrix[cor_matrix <= 0.49] <- NA
corrplot(filtered_matrix, method = "number", type = "lower",tl.cex = 0.55, number.cex = 0.5, na.label=" ")
to delete: NumKidsBornNeverMar, MalePctDivorce, FemalePctDiv,PctFam2Par,PctYoungKids2Par, PctTeen2Par, PctWorkMomYoungKids, PctKidsBornNeverMar
It was decided to keep PctKids2Par (percentage of kids in family housing with two parents) and not PctYoungKids2Par (percentage of kids <=4 in two parent households) nor PctTeen2Par (percentage of kids age 12-17 in two parent households) since it covers both. Same thing with choosing to include PctWorkMom (percentage of working mothers) over PctWorkMomYoungKids (Percentage of working mothers with young kids)
cor_matrix <-
crime2 |>
dplyr::select(MalePctNevMarr, TotalPctDiv, PersPerFam, PctKids2Par, PctWorkMom) |>
cor() |>
round(2)
filtered_matrix <- abs(cor_matrix)
filtered_matrix[cor_matrix <= 0.49] <- NA
corrplot(filtered_matrix, method = "number", type = "lower",tl.cex = 0.55, number.cex = 0.5, na.label=" ")
Adding new variables to dataset:
murder:
a1 <- crime2_m |> ggplot(aes(x = murder, y = MalePctNevMarr, fill = factor(murder))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Murder in MalePctNevMarr")
a2 <- crime2_m |> ggplot(aes(x = murder, y = TotalPctDiv, fill = factor(murder))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Murder in TotalPctDiv")
a3 <- crime2_m |> ggplot(aes(x = murder, y = PersPerFam, fill = factor(murder))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Murder in PersPerFam")
a4 <- crime2_m |> ggplot(aes(x = murder, y = PctKids2Par, fill = factor(murder))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Murder in PctKids2Par")
a5 <- crime2_m |> ggplot(aes(x = murder, y = PctWorkMom, fill = factor(murder))) +
geom_boxplot(alpha=0.3) +
coord_flip() + labs(y="", title="Murder in PctWorkMom")
multiplot(a1, a2, a3, a4 ,a5, cols = 2)
As opposed to variables related to the police force, here we can see the distribution of the values of the target variable within the independent variable well. TotalPctDiv (Total percentage of divorced people) for example tends to have its higher values in murder = 1, which suggests that several above-average dangerous communities have a higher divorce rate.
Lastly, we will do a heat map for the collinearity for all the variables left to see if there is an overlap of information across the different groups.
cor_matrix <-
crime2 |>
dplyr::select(perCapInc, blackPerCap, indianPerCap, AsianPerCap, OtherPerCap, pctWWage, MedRentPctHousInc, MedOwnCostPctInc, MedOwnCostPctIncNoMtg, population, racepctblack, racePctWhite, agePct12t29, PctImmigRecent, PctSpeakEnglOnly, PctForeignBorn, PctBornSameState, PctSameHouse85, PolicReqPerOffic, PolicPerPop, RacialMatchCommPol, PctPolicWhite, PctPolicAsian, PctPolicMinor, NumKindsDrugsSeiz, PolicAveOTWorked, LandArea, PctUsePubTrans, PolicOperBudg, LemasPctPolicOnPatr, LemasGangUnitDeploy, LemasPctOfficDrugUn, MalePctNevMarr, TotalPctDiv, PersPerFam, PctKids2Par, PctWorkMom) |>
cor() |>
round(2)
filtered_matrix <- abs(cor_matrix)
filtered_matrix[cor_matrix <= 0.49] <- NA
corrplot(filtered_matrix, method = "number", type = "lower",tl.cex = 0.55, number.cex = 0.5, na.label=" ")
Lets exclude PctKids2Par, population, MalePctNevMarr and MedOwnCostPctInc from the final dataset.
cor_matrix <-
crime2 |>
dplyr::select(perCapInc, blackPerCap, indianPerCap, AsianPerCap, OtherPerCap, pctWWage, MedRentPctHousInc, MedOwnCostPctIncNoMtg, racepctblack, racePctWhite, agePct12t29, PctImmigRecent, PctSpeakEnglOnly, PctForeignBorn, PctBornSameState, PctSameHouse85, PolicReqPerOffic, PolicPerPop, RacialMatchCommPol, PctPolicWhite, PctPolicAsian, PctPolicMinor, NumKindsDrugsSeiz, PolicAveOTWorked, LandArea, PctUsePubTrans, PolicOperBudg, LemasPctPolicOnPatr, LemasGangUnitDeploy, LemasPctOfficDrugUn, TotalPctDiv, PersPerFam, PctWorkMom) |>
cor() |>
round(2)
filtered_matrix <- abs(cor_matrix)
filtered_matrix[cor_matrix <= 0.49] <- NA
corrplot(filtered_matrix, method = "number", type = "lower",tl.cex = 0.55, number.cex = 0.5, na.label=" ")
Now we have to transform the numerical variables of different scales (except the target variable) to a common scale to better analyze and draw comparisons among them.
listconti= c("perCapInc", "blackPerCap", "indianPerCap", "AsianPerCap", "OtherPerCap", "pctWWage", "MedRentPctHousInc", "MedOwnCostPctIncNoMtg", "racepctblack", "racePctWhite", "agePct12t29", "PctImmigRecent", "PctSpeakEnglOnly", "PctForeignBorn", "PctBornSameState", "PctSameHouse85", "PolicReqPerOffic", "PolicPerPop", "RacialMatchCommPol", "PctPolicWhite", "PctPolicAsian", "PctPolicMinor", "NumKindsDrugsSeiz", "PolicAveOTWorked", "LandArea", "PctUsePubTrans", "PolicOperBudg", "LemasPctPolicOnPatr", "LemasGangUnitDeploy", "LemasPctOfficDrugUn", "TotalPctDiv", "PersPerFam", "PctWorkMom"
)
vardep<-"murder"
means <-apply(crime2_m[,listconti],2,mean,na.rm=TRUE)
sds<-sapply(crime2_m[,listconti],sd,na.rm=TRUE)
crime_m_2<-scale(crime2_m[,listconti], center = means, scale = sds)
crime_m_final <- data.frame(crime_m_2)
new_var <- crime2_m_b |> select(murder)
crime_m_final <- cbind(new_var, crime_m_final)
c("murder", "perCapInc", "blackPerCap", "indianPerCap", "AsianPerCap",
"OtherPerCap", "pctWWage", "MedRentPctHousInc", "MedOwnCostPctIncNoMtg",
"racepctblack", "racePctWhite", "agePct12t29", "PctImmigRecent",
"PctSpeakEnglOnly", "PctForeignBorn", "PctBornSameState", "PctSameHouse85",
"PolicReqPerOffic", "PolicPerPop", "RacialMatchCommPol", "PctPolicWhite",
"PctPolicAsian", "PctPolicMinor", "NumKindsDrugsSeiz", "PolicAveOTWorked",
"LandArea", "PctUsePubTrans", "PolicOperBudg", "LemasPctPolicOnPatr",
"LemasGangUnitDeploy", "LemasPctOfficDrugUn", "TotalPctDiv",
"PersPerFam", "PctWorkMom")
vardep= c("murder")
variable_indepe = c("perCapInc", "blackPerCap", "indianPerCap", "AsianPerCap", "OtherPerCap", "pctWWage", "MedRentPctHousInc", "MedOwnCostPctIncNoMtg", "racepctblack", "racePctWhite", "agePct12t29", "PctImmigRecent", "PctSpeakEnglOnly", "PctForeignBorn", "PctBornSameState", "PctSameHouse85", "PolicReqPerOffic", "PolicPerPop", "RacialMatchCommPol", "PctPolicWhite", "PctPolicAsian", "PctPolicMinor", "NumKindsDrugsSeiz", "PolicAveOTWorked", "LandArea", "PctUsePubTrans", "PolicOperBudg", "LemasPctPolicOnPatr", "LemasGangUnitDeploy", "LemasPctOfficDrugUn", "TotalPctDiv", "PersPerFam", "PctWorkMom"
)
y<-archivo_m[,vardep]
x<-archivo_m[,variable_indepe]
Before we continue we are going to use parallel computing to do a parallelization by using all the cores of the device except for 1 and creating a cluster.
GS_T0 <- Sys.time()
cluster <- makeCluster(detectCores() - 1)
registerDoParallel(cluster)
We continue with the selection of variables, using different methods:
full<-glm(murder~.,data=archivo_m,family = binomial(link="logit"))
null<-glm(murder~1,data=archivo_m,family = binomial(link="logit"))
selec1<-stepAIC(null,scope=list(upper=full),
direction="both",family = binomial(link="logit"),trace=FALSE)
vec<-(names(selec1[[1]]))
length(vec)
[1] 18
dput(vec)
c("(Intercept)", "racePctWhite", "TotalPctDiv", "pctWWage", "blackPerCap",
"PolicOperBudg", "PctPolicMinor", "LemasPctOfficDrugUn", "PctForeignBorn",
"MedRentPctHousInc", "PolicAveOTWorked", "PersPerFam", "PolicPerPop",
"racepctblack", "PctBornSameState", "PctPolicWhite", "indianPerCap",
"PctWorkMom")
AIC_list <- c("racePctWhite", "TotalPctDiv", "pctWWage", "blackPerCap",
"PolicOperBudg", "PctPolicMinor", "LemasPctOfficDrugUn", "PctForeignBorn",
"MedRentPctHousInc", "PolicAveOTWorked", "PersPerFam", "PolicPerPop",
"racepctblack", "PctBornSameState", "PctPolicWhite", "indianPerCap",
"PctWorkMom")
We calculate k with the formula k=log(n): k=log(2908) -> k=7.975
full<-glm(murder~.,data=archivo_m,family = binomial(link="logit"))
null<-glm(murder~1,data=archivo_m,family = binomial(link="logit"))
selec1<-stepAIC(null,scope=list(upper=full),
direction="both",family = binomial(link="logit"),trace=FALSE,k=7.975)
vec<-(names(selec1[[1]]))
length(vec)
[1] 9
dput(vec)
c("(Intercept)", "racePctWhite", "TotalPctDiv", "pctWWage", "blackPerCap",
"PolicOperBudg", "PctPolicMinor", "LemasPctOfficDrugUn", "PctForeignBorn"
)
BIC_list <- c("racePctWhite", "TotalPctDiv", "pctWWage", "blackPerCap",
"PolicOperBudg", "PctPolicMinor", "LemasPctOfficDrugUn", "PctForeignBorn"
)
Boruta performed 16 iterations in 1.130421 mins.
33 attributes confirmed important: agePct12t29, AsianPerCap,
blackPerCap, indianPerCap, LandArea and 28 more;
No attributes deemed unimportant.
summary(out.boruta)
Length Class Mode
finalDecision 33 factor numeric
ImpHistory 576 -none- numeric
pValue 1 -none- numeric
maxRuns 1 -none- numeric
light 1 -none- logical
mcAdj 1 -none- logical
timeTaken 1 difftime numeric
roughfixed 1 -none- logical
call 3 -none- call
impSource 1 -none- character
sal<-data.frame(out.boruta$finalDecision)
sal2<-sal[which(sal$out.boruta.finalDecision=="Confirmed"),,drop=FALSE]
dput(row.names(sal2))
c("perCapInc", "blackPerCap", "indianPerCap", "AsianPerCap",
"OtherPerCap", "pctWWage", "MedRentPctHousInc", "MedOwnCostPctIncNoMtg",
"racepctblack", "racePctWhite", "agePct12t29", "PctImmigRecent",
"PctSpeakEnglOnly", "PctForeignBorn", "PctBornSameState", "PctSameHouse85",
"PolicReqPerOffic", "PolicPerPop", "RacialMatchCommPol", "PctPolicWhite",
"PctPolicAsian", "PctPolicMinor", "NumKindsDrugsSeiz", "PolicAveOTWorked",
"LandArea", "PctUsePubTrans", "PolicOperBudg", "LemasPctPolicOnPatr",
"LemasGangUnitDeploy", "LemasPctOfficDrugUn", "TotalPctDiv",
"PersPerFam", "PctWorkMom")
Boruta_list <-
c("perCapInc", "blackPerCap", "indianPerCap", "AsianPerCap",
"OtherPerCap", "pctWWage", "MedRentPctHousInc", "MedOwnCostPctIncNoMtg",
"racepctblack", "racePctWhite", "agePct12t29", "PctImmigRecent",
"PctSpeakEnglOnly", "PctForeignBorn", "PctBornSameState", "PctSameHouse85",
"PolicReqPerOffic", "PolicPerPop", "RacialMatchCommPol", "PctPolicWhite",
"PctPolicAsian", "PctPolicMinor", "NumKindsDrugsSeiz", "PolicAveOTWorked",
"LandArea", "PctUsePubTrans", "PolicOperBudg", "LemasPctPolicOnPatr",
"LemasGangUnitDeploy", "LemasPctOfficDrugUn", "TotalPctDiv",
"PersPerFam", "PctWorkMom")
mmpc1 <- MMPC(target = vardep, dataset=crime_m_final, max_k = 3, hash = TRUE,
test = "testIndLogistic")
mmpc1@selectedVars
[1] 3 8 10 11 23 28 31 32
c("blackPerCap", "MedRentPctHousInc", "racepctblack", "racePctWhite",
"PctPolicMinor", "PolicOperBudg", "LemasPctOfficDrugUn", "TotalPctDiv"
)
length(a)
[1] 8
MXM_list <- c("blackPerCap", "MedRentPctHousInc", "racepctblack", "racePctWhite",
"PctPolicMinor", "PolicOperBudg", "LemasPctOfficDrugUn", "TotalPctDiv"
)
SES1 <- SES(vardep, crime_m_final, max_k = 3, hash = TRUE,
test = "testIndLogistic")
SES1@selectedVars
[1] 3 8 10 11 23 28 31 32
c("blackPerCap", "MedRentPctHousInc", "racepctblack", "racePctWhite",
"PctPolicMinor", "PolicOperBudg", "LemasPctOfficDrugUn", "TotalPctDiv"
)
c("blackPerCap", "MedRentPctHousInc", "racepctblack", "racePctWhite",
"PctPolicMinor", "PolicOperBudg", "LemasPctOfficDrugUn", "TotalPctDiv"
)
length(a)
[1] 8
SES_list <- c("blackPerCap", "MedRentPctHousInc", "racepctblack", "racePctWhite",
"PctPolicMinor", "PolicOperBudg", "LemasPctOfficDrugUn", "TotalPctDiv"
)
filtro<-sbf(x,y,sbfControl =
sbfControl(functions = rfSBF,
method = "cv", verbose = FALSE))
a<-dput(filtro$optVariables)
c("perCapInc", "blackPerCap", "indianPerCap", "AsianPerCap",
"OtherPerCap", "pctWWage", "MedRentPctHousInc", "MedOwnCostPctIncNoMtg",
"racepctblack", "racePctWhite", "agePct12t29", "PctImmigRecent",
"PctSpeakEnglOnly", "PctForeignBorn", "PctBornSameState", "PctSameHouse85",
"PolicReqPerOffic", "PolicPerPop", "RacialMatchCommPol", "PctPolicWhite",
"PctPolicAsian", "PctPolicMinor", "LandArea", "PctUsePubTrans",
"PolicOperBudg", "LemasPctPolicOnPatr", "LemasPctOfficDrugUn",
"TotalPctDiv", "PersPerFam", "PctWorkMom")
length(a)
[1] 30
SBF_list <- c("perCapInc", "blackPerCap", "indianPerCap", "AsianPerCap",
"OtherPerCap", "pctWWage", "MedRentPctHousInc", "MedOwnCostPctIncNoMtg",
"racepctblack", "racePctWhite", "agePct12t29", "PctImmigRecent",
"PctSpeakEnglOnly", "PctForeignBorn", "PctBornSameState", "PctSameHouse85",
"PolicReqPerOffic", "PolicPerPop", "RacialMatchCommPol", "PctPolicWhite",
"PctPolicAsian", "PctPolicMinor", "LandArea", "PctUsePubTrans",
"PolicOperBudg", "LemasPctPolicOnPatr", "LemasPctOfficDrugUn",
"TotalPctDiv", "PersPerFam", "PctWorkMom")
control <- rfeControl(functions=rfFuncs, method="cv", number=10)
results <- rfe(x, y, sizes=c(1:8), rfeControl=control)
selecrfe<-results$optVariables
length(selecrfe)
[1] 33
dput(selecrfe)
c("racePctWhite", "TotalPctDiv", "racepctblack", "pctWWage",
"PctImmigRecent", "PctWorkMom", "PctBornSameState", "LandArea",
"AsianPerCap", "blackPerCap", "perCapInc", "indianPerCap", "PersPerFam",
"PctSameHouse85", "OtherPerCap", "PctSpeakEnglOnly", "MedRentPctHousInc",
"MedOwnCostPctIncNoMtg", "agePct12t29", "PctUsePubTrans", "PctForeignBorn",
"PctPolicMinor", "PolicAveOTWorked", "PolicPerPop", "LemasPctOfficDrugUn",
"LemasPctPolicOnPatr", "PctPolicWhite", "PolicOperBudg", "PolicReqPerOffic",
"NumKindsDrugsSeiz", "RacialMatchCommPol", "PctPolicAsian", "LemasGangUnitDeploy"
)
RFE_list <-c("racePctWhite", "TotalPctDiv", "racepctblack", "pctWWage",
"PctImmigRecent", "PctBornSameState", "LandArea", "AsianPerCap",
"PctWorkMom", "perCapInc", "blackPerCap", "indianPerCap", "OtherPerCap",
"PctSpeakEnglOnly", "agePct12t29", "PctSameHouse85", "PersPerFam",
"MedRentPctHousInc", "MedOwnCostPctIncNoMtg", "PctUsePubTrans",
"PctForeignBorn", "PctPolicMinor", "PolicAveOTWorked", "LemasPctPolicOnPatr",
"PolicPerPop", "PolicOperBudg", "LemasPctOfficDrugUn", "PctPolicWhite",
"PolicReqPerOffic", "RacialMatchCommPol", "NumKindsDrugsSeiz",
"PctPolicAsian", "LemasGangUnitDeploy")
Let’s compare the algorithms with their different amounts of variables chosen.
crime_m_final2 <- crime_m_final
crime_m_final2$murder<-ifelse(crime_m_final2$murder==1,"Yes","No")
crime_m_final2$murder<-factor(crime_m_final2$murder)
crime_m_final2 <- na.omit(crime_m_final2 )
crime_m_final2 <-as_tibble(crime_m_final2 )
crime_m_final2 <- as.data.frame(crime_m_final2 )
data2<- crime_m_final2
medias1<-cruzadalogistica(data=data2,
vardep="murder",listconti=AIC_list,
listclass=c(""),grupos=4,sinicio=1234,repe=25)
medias1 <- as.data.frame(medias1[[1]])
medias1$modelo="AIC"
medias2<-cruzadalogistica(data=data2,
vardep="murder",listconti=BIC_list,
listclass=c(""),grupos=4,sinicio=1234,repe=25)
medias2 <- as.data.frame(medias2[[1]])
medias2$modelo="BIC"
medias3<-cruzadalogistica(data=data2,
vardep="murder",listconti=
Boruta_list,
listclass=c(""),grupos=4,sinicio=1234,repe=25)
medias3 <- as.data.frame(medias3[[1]])
medias3$modelo="Boruta"
medias4<-cruzadalogistica(data=data2,
vardep="murder",listconti=RFE_list,
listclass=c(""),grupos=4,sinicio=1234,repe=25)
medias4 <- as.data.frame(medias4[[1]])
medias4$modelo="RFE"
medias5<-cruzadalogistica(data=data2,
vardep="murder",listconti=
SBF_list,
listclass=c(""),grupos=4,sinicio=1234,repe=25)
medias5 <- as.data.frame(medias5[[1]])
medias5$modelo="SBF"
medias6<-cruzadalogistica(data=data2,
vardep="murder",listconti=
MXM_list,
listclass=c(""),grupos=4,sinicio=1234,repe=25)
medias6 <- as.data.frame(medias6[[1]])
medias6$modelo="MXM"
medias7<-cruzadalogistica(data=data2,
vardep="murder",listconti=
SES_list,
listclass=c(""),grupos=4,sinicio=1234,repe=25)
medias7 <- as.data.frame(medias7[[1]])
medias7$modelo="SES"
union1<-rbind(medias1,medias2, medias3,
medias4,medias5,
medias6, medias7)
par(cex.axis=0.7)
boxplot(data=union1,col="red",tasa~modelo,main="Error Rate")
It can be seen that the best model is AIC, so we will choose its variables.
choose variables of AIC: c(“racePctWhite”, “TotalPctDiv”, “pctWWage”, “blackPerCap”, “PolicOperBudg”, “PctPolicMinor”, “LemasPctOfficDrugUn”, “PctForeignBorn”, “MedRentPctHousInc”, “PolicAveOTWorked”, “PersPerFam”, “PolicPerPop”, “racepctblack”, “PctBornSameState”, “PctPolicWhite”, “indianPerCap”, “PctWorkMom”)
| Model | Num. Variables | Variables |
|---|---|---|
| AIC | 17 | c(“racePctWhite”, “TotalPctDiv”, “pctWWage”, “blackPerCap”, “PolicOperBudg”, “PctPolicMinor”, “LemasPctOfficDrugUn”, “PctForeignBorn”, “MedRentPctHousInc”,“PolicAveOTWorked”, “PersPerFam”, “PolicPerPop”, “racepctblack”, “PctBornSameState”, “PctPolicWhite”, “indianPerCap”, “PctWorkMom”) |
| Wrapper Boruta | 33 | c(“perCapInc”, “blackPerCap”, “indianPerCap”, “AsianPerCap”, “OtherPerCap”, “pctWWage”, “MedRentPctHousInc”, “MedOwnCostPctIncNoMtg”, “racepctblack”,“racePctWhite”, “agePct12t29”, “PctImmigRecent”, “PctSpeakEnglOnly”, “PctForeignBorn”, “PctBornSameState”, “PctSameHouse85”,“PolicReqPerOffic”, “PolicPerPop”,“RacialMatchCommPol”, “PctPolicWhite”,“PctPolicAsian”, “PctPolicMinor”, “NumKindsDrugsSeiz”, “PolicAveOTWorked”, “LandArea”, “PctUsePubTrans”, “PolicOperBudg”,“LandArea”, “PctUsePubTrans”, “PolicOperBudg”, “LemasPctPolicOnPatr”, “LemasGangUnitDeploy”, “LemasPctOfficDrugUn”, “TotalPctDiv”, “PersPerFam”, “PctWorkMom”) |
| MXM | 8 | c(“blackPerCap”, “MedRentPctHousInc”, “racepctblack”, “racePctWhite”, “PctPolicMinor”, “PolicOperBudg”, “LemasPctOfficDrugUn”, “TotalPctDiv”) |
| Wrapper SES | 8 | c(“blackPerCap”, “MedRentPctHousInc”, “racepctblack”, “racePctWhite”, “PctPolicMinor”, “PolicOperBudg”, “LemasPctOfficDrugUn”, “TotalPctDiv”) |
| SBF | 30 | c(“perCapInc”, “blackPerCap”, “indianPerCap”, “AsianPerCap”, “OtherPerCap”, “pctWWage”, “MedRentPctHousInc”, “MedOwnCostPctIncNoMtg”, “racepctblack”, “racePctWhite”,“agePct12t29”, “PctImmigRecent”, “PctSpeakEnglOnly”, “PctForeignBorn”, “PctBornSameState”, “PctSameHouse85”, “PolicReqPerOffic”, “PolicPerPop”, “RacialMatchCommPol”, “PctPolicWhite”, “PctPolicAsian”, “PctPolicMinor”, “LandArea”, “PctUsePubTrans”, “PolicOperBudg”, “LemasPctPolicOnPatr”, “LemasPctOfficDrugUn”, “TotalPctDiv”,“PersPerFam”, “PctWorkMom”) |
| RFE | 33 | c(“racePctWhite”, “TotalPctDiv”, “racepctblack”, “pctWWage”, “PctImmigRecent”, “PctBornSameState”, “LandArea”, “AsianPerCap”, “PctWorkMom”, “perCapInc”,“blackPerCap”, “indianPerCap”, “OtherPerCap”, “PctSpeakEnglOnly”, “agePct12t29”, “PctSameHouse85”, “PersPerFam”, “MedRentPctHousInc”,“MedOwnCostPctIncNoMtg”, “PctUsePubTrans”,“PctForeignBorn”, “PctPolicMinor”, “PolicAveOTWorked”, “LemasPctPolicOnPatr”, “PolicPerPop”, “PolicOperBudg”, “LemasPctOfficDrugUn”, “PctPolicWhite”,“PolicReqPerOffic”, “RacialMatchCommPol”, “NumKindsDrugsSeiz”, “PctPolicAsian”, “LemasGangUnitDeploy”) |
| BIC | 8 | c(“racePctWhite”, “TotalPctDiv”, “pctWWage”, “blackPerCap”, “PolicOperBudg”, “PctPolicMinor”, “LemasPctOfficDrugUn”, “PctForeignBorn”) |
variables<-c("racePctWhite", "TotalPctDiv", "pctWWage", "blackPerCap",
"PolicOperBudg", "PctPolicMinor", "LemasPctOfficDrugUn", "PctForeignBorn",
"MedRentPctHousInc", "PolicAveOTWorked", "PersPerFam", "PolicPerPop",
"racepctblack", "PctBornSameState", "PctPolicWhite", "indianPerCap",
"PctWorkMom")
data2<-data2[,c(variables,vardep)]
Now that we have decided which variables to go with, let’s look for the best parameters of the network, first without and then with early stopping. First of all it will be important to calculate the number of nodes with the formula: N of parameters =h(k+1)+h+1. We have 2908 observations and 17 variables and we want 30 observations per parameter. That implies that we will have 2908/30= 97 parameters.
With 5 hidden nodes, we will need at least 2880 observations:
5(17+1)+5+1 = 96 -> 96x30= 2880
with 7 nodes it would be 4020 observations.
Hopefully we will find an adequate number of nodes to avoid over-fitting (too many nodes) or under-fitting (too few nodes). In the next section we will tune our network with repeated cross validation with avnnet.
control<-trainControl(method = "repeatedcv",
number=4,savePredictions = "all"
,classProbs = TRUE, repeats=4,
summaryFunction=twoClassSummary
)
set.seed(123)
nnetgrid <- expand.grid(size=c(5, 6,8, 10),decay=c(0.1, 0.01, 0.001),bag=F)
completo<-data.frame()
listaiter<-c(1, 10, 20, 50, 70)
for (iter in listaiter)
{
rednnet<- train(murder~.,
data=data2,
method="avNNet",linout = FALSE,maxit=iter,
trControl=control,repeats=5,tuneGrid=nnetgrid,trace=F, metric='ROC')
rednnet$results$itera<-iter
completo<-rbind(completo,rednnet$results)
}
completo<-completo[order(completo$ROC),]
ggplot(completo, aes(x=factor(itera), y=ROC,
color=factor(decay),pch=factor(size))) +
geom_point(position=position_dodge(width=0.5),size=3)
We want a ROC as high as possible with the least iterations possible. We see that between 50 and 70 iterations the curve flattens. So let’s concentrate on this range. We will increase the size to see whether we can get better results.
control<-trainControl(method = "repeatedcv",
number=4,savePredictions = "all"
,classProbs = TRUE, repeats=4,
summaryFunction=twoClassSummary
)
set.seed(123)
nnetgrid <- expand.grid(size=c(8, 10, 15, 20),decay=c(0.1, 0.01, 0.001),bag=F)
completo<-data.frame()
listaiter<-c( 60, 70, 80)
for (iter in listaiter)
{
rednnet<- train(murder~.,
data=data2,
method="avNNet",linout = FALSE,maxit=iter,
trControl=control,repeats=5,tuneGrid=nnetgrid,trace=F, metric='ROC')
rednnet$results$itera<-iter
completo<-rbind(completo,rednnet$results)
}
completo<-completo[order(completo$ROC),]
ggplot(completo, aes(x=factor(itera), y=ROC,
color=factor(decay),pch=factor(size))) +
geom_point(position=position_dodge(width=0.5),size=3)
We can still see that the performance is increasing so we are choosing a bigger size and more iterations.
control<-trainControl(method = "repeatedcv",
number=4,savePredictions = "all"
,classProbs = TRUE, repeats=4,
summaryFunction=twoClassSummary
)
set.seed(13)
nnetgrid <- expand.grid(size=c(15, 20, 25, 30),decay=c(0.1, 0.01, 0.001),bag=F)
completo<-data.frame()
listaiter<-c( 80, 90, 100)
for (iter in listaiter)
{
rednnet<- train(murder~.,
data=data2,
method="avNNet",linout = FALSE,maxit=iter,
trControl=control,repeats=5,tuneGrid=nnetgrid,trace=F, metric='ROC')
rednnet$results$itera<-iter
completo<-rbind(completo,rednnet$results)
}
completo<-completo[order(completo$ROC),]
ggplot(completo, aes(x=factor(itera), y=ROC,
color=factor(decay),pch=factor(size))) +
geom_point(position=position_dodge(width=0.5),size=3)
The best number of iterations is 101, the best learning rate 0.01 and the best number of nodes is 30. We leave it tuned like that and do early stopping so that it is not overfitted. But before we compare it to the previous models, we are going to see whether we get a better performance with the mentioned parameters and a maxit of 100.
set.seed(12)
nnetgrid <- expand.grid(size=c(30),decay=c(0.01),bag=F)
completo<-data.frame()
listaiter<-c(101)
for (iter in listaiter)
{
rednnet<- train(murder~.,
data=data2,
method="avNNet",linout = FALSE,maxit=100,
trControl=control,repeats=5,tuneGrid=nnetgrid,trace=F)
rednnet$results$itera<-iter
completo<-rbind(completo,rednnet$results)
}
completo<-completo[order(completo$ROC),]
ggplot(completo, aes(x=factor(itera), y=ROC,
color=factor(decay),pch=factor(size))) +
geom_point(position=position_dodge(width=0.5),size=3)
The ROC is not higher with the maxit at 100, so we will continue with the network without including the maxit.
data2<-crime_m_final2[,c(AIC_list, "murder")]
medias8<-cruzadaavnnetbin(data=data2, vardep="murder",
listconti=AIC_list, listclass=c(""),
grupos=4, sinicio=1234, repe=25,
repeticiones=5, itera=101, size=c(30),
decay=c(0.01))
size decay bag Accuracy Kappa AccuracySD KappaSD
1 30 0.01 FALSE 0.8616215 0.7232425 0.01392548 0.02785255
medias8 <- as.data.frame(medias8[[1]])
medias8$modelo="Network"
union1<-rbind(medias1,medias2, medias3, medias4, medias5, medias6, medias7, medias8)
par(cex.axis=0.8)
boxplot(data=union1,col="red",tasa~modelo,main="Error Rate")
Thanks to the tuning done, the neuronal network is the best models now by far.
We will continue with the Bagging algorithm and find out which sampsize and which ntree are optimal by performing a tuning. For this we will do a repeated cross-validation with caret. We will select the value of mtry equal to the number of independent variables we have obtained from the AIC model. In addition to Accuracy, we will also compare the means of Kappa. Kappa is important because it tells us how reliable or consistent our ratings are.It helps us assess whether the evaluators are in general agreement or whether their decisions are more random. That is, a high Kappa value helps to avoid incorrect interpretations or conclusions due to random agreement. Therefore, it would be best to have a Kappa superior to 0.6.
It is important to note that sampsize is the parameter that allows us to achieve class balance. As seen above, the balance of the values in target variable murder are perfect due to oversampling. Nevertheless we will take it into account to further improve our model.To determine its appropriate value, we will perform a repeated cross-validation with different values in the range of 10 to 200, since the minimum number of sampsize should not be too high, to avoid overfitting.
set.seed(23)
nmin <- min(table(data2$murder))
mtry <- c(length(AIC_list))
control<-trainControl(method = "repeatedcv", number=4,
allowParallel = TRUE,
repeats = 5, savePredictions = "all",search = 'grid', classProbs=TRUE
#, summaryFunction=twoClassSummary
)
rfgrid <- expand.grid(.mtry = mtry)
modellist <- list()
sizes <- c(10, 50, 100, 200)
for (sampsize in sizes){
set.seed(234)
rf<- train(factor(murder)~.,data=data2,
method="rf",trControl=control,tuneGrid=rfgrid,
linout = FALSE,replace=TRUE, importance=T, ntree=500, nodesize=20,
metric='ROC', sampsize=c(sampsize,sampsize))
key <- toString(sampsize)
modellist[[key]] <- rf
}
results <- resamples(modellist)
summary(results)
Call:
summary.resamples(object = results)
Models: 10, 50, 100, 200
Number of resamples: 20
Accuracy
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
10 0.7637363 0.7812930 0.7866482 0.7870673 0.7949863 0.8076923 0
50 0.7744154 0.7958068 0.7991743 0.8010312 0.8077717 0.8214286 0
100 0.7689133 0.8002063 0.8027495 0.8036451 0.8088843 0.8241758 0
200 0.7867950 0.8069516 0.8136176 0.8133429 0.8195895 0.8337912 0
Kappa
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
10 0.5274725 0.5625760 0.5733081 0.5741383 0.5899725 0.6153846 0
50 0.5488402 0.5916545 0.5983486 0.6020640 0.6155451 0.6428571 0
100 0.5378328 0.6004251 0.6054815 0.6072933 0.6177686 0.6483516 0
200 0.5736006 0.6139031 0.6272198 0.6266895 0.6391790 0.6675824 0
We can see that the best sampsize is 200, since Accuracy and Kappa are the highest. Kappa’s mean is above 0.6 and Accuracy barely surpasses 0.8. It might be the case that Kappa and Accuracy will still increase with an even bigger sampsize. Therefore, we are going to tune by choosing parameters above 200.
set.seed(234)
#message("Numero mínimo en sampsize es ",(nmin*0.75))
nmin <- min(table(data2$murder))
mtry <- c(length(AIC_list))
control<-trainControl(method = "repeatedcv", number=4,
allowParallel = TRUE,
repeats = 5, savePredictions = "all",search = 'grid', classProbs=TRUE
#, summaryFunction=twoClassSummary
)
rfgrid <- expand.grid(.mtry = mtry)
modellist <- list()
sizes <- c( 200, 250, 300, 350, 400)
for (sampsize in sizes){
set.seed(234)
rf<- train(factor(murder)~.,data=data2,
method="rf",trControl=control,tuneGrid=rfgrid,
linout = FALSE,replace=TRUE, importance=T, ntree=500, nodesize=20,
metric='ROC', sampsize=c(sampsize,sampsize))
key <- toString(sampsize)
modellist[[key]] <- rf
}
results <- resamples(modellist)
summary(results)
Call:
summary.resamples(object = results)
Models: 200, 250, 300, 350, 400
Number of resamples: 20
Accuracy
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
200 0.7867950 0.8069516 0.8136176 0.8133429 0.8195895 0.8337912 0
250 0.7936726 0.8134258 0.8184316 0.8168490 0.8202754 0.8337912 0
300 0.7977992 0.8174640 0.8239355 0.8215279 0.8267906 0.8324176 0
350 0.8005502 0.8197454 0.8254295 0.8252409 0.8333906 0.8420330 0
400 0.8060523 0.8250261 0.8321871 0.8297127 0.8347107 0.8415978 0
Kappa
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
200 0.5736006 0.6139031 0.6272198 0.6266895 0.6391790 0.6675824 0
250 0.5873570 0.6268389 0.6368631 0.6337012 0.6405507 0.6675824 0
300 0.5956083 0.6349276 0.6478711 0.6430595 0.6535813 0.6648352 0
350 0.6011102 0.6394954 0.6508844 0.6504850 0.6667789 0.6840659 0
400 0.6121170 0.6500269 0.6644003 0.6594296 0.6694215 0.6831956 0
Highest Kappa and Accuracy seem to be at sampsize 400, which means we can tune one more round to see whether they keep increasing with a higher sampsize.
set.seed(234)
#message("Numero mínimo en sampsize es ",(nmin*0.75))
nmin <- min(table(data2$murder))
mtry <- c(length(AIC_list))
control<-trainControl(method = "repeatedcv", number=4,
allowParallel = TRUE,
repeats = 5, savePredictions = "all",search = 'grid', classProbs=TRUE
#, summaryFunction=twoClassSummary
)
rfgrid <- expand.grid(.mtry = mtry)
modellist <- list()
sizes <- c( 400, 450, 500, 550, 600)
for (sampsize in sizes){
set.seed(234)
rf<- train(factor(murder)~.,data=data2,
method="rf",trControl=control,tuneGrid=rfgrid,
linout = FALSE,replace=TRUE, importance=T, ntree=500, nodesize=20,
metric='ROC', sampsize=c(sampsize,sampsize))
key <- toString(sampsize)
modellist[[key]] <- rf
}
results <- resamples(modellist)
summary(results)
Call:
summary.resamples(object = results)
Models: 400, 450, 500, 550, 600
Number of resamples: 20
Accuracy
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
400 0.8060523 0.8250261 0.8321871 0.8297127 0.8347107 0.8415978 0
450 0.8115543 0.8295530 0.8334480 0.8328750 0.8393536 0.8489011 0
500 0.8184319 0.8296118 0.8363152 0.8365870 0.8444061 0.8530220 0
550 0.8143054 0.8371686 0.8430832 0.8415407 0.8464187 0.8571429 0
600 0.8129298 0.8384331 0.8430832 0.8436055 0.8512397 0.8624484 0
Kappa
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
400 0.6121170 0.6500269 0.6644003 0.6594296 0.6694215 0.6831956 0
450 0.6231179 0.6591060 0.6669164 0.6657546 0.6787149 0.6978022 0
500 0.6368769 0.6592279 0.6726304 0.6731782 0.6888142 0.7060440 0
550 0.6286227 0.6743470 0.6861590 0.6830857 0.6928375 0.7142857 0
600 0.6258647 0.6768941 0.6861566 0.6872148 0.7024793 0.7249067 0
The best sampsize is 600, with a Accuracy of 0.84 and a Kappa 0.69.
We will use sampsize 600 to determine the number of trees to be included in the bagging.
ntrees <- c(50, 100, 200, 300)
modellist <- list()
for (ntree in ntrees){
set.seed(12345)
rf<- train(factor(murder)~.,data=data2,
method="rf",trControl=control,tuneGrid=rfgrid,
linout = FALSE,replace=TRUE, importance=T, ntree=ntree,
nodesize=20, metric='ROC', sampsize=c(600, 600))
key <- toString(ntree)
modellist[[key]] <- rf
}
#Compare results
results <- resamples(modellist)
summary(results)
Call:
summary.resamples(object = results)
Models: 50, 100, 200, 300
Number of resamples: 20
Accuracy
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
50 0.8090659 0.8335626 0.8375783 0.8377621 0.8433542 0.8569464 0
100 0.8090659 0.8363136 0.8452545 0.8417500 0.8487999 0.8571429 0
200 0.8131868 0.8361384 0.8446744 0.8424377 0.8508953 0.8598901 0
300 0.8173077 0.8376891 0.8432998 0.8422312 0.8512908 0.8557692 0
Kappa
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
50 0.6181319 0.6671182 0.6751780 0.6755258 0.6867119 0.7139073 0
100 0.6181319 0.6726551 0.6905148 0.6835015 0.6975997 0.7142857 0
200 0.6263736 0.6722766 0.6893485 0.6848757 0.7017906 0.7197802 0
300 0.6346154 0.6753838 0.6865953 0.6844631 0.7025838 0.7115385 0
We can see the Accuracy is above 0.8 which is good, and the Kappa is even close 0.7. However, there is a growing tendency, wherefore ntrees above 300 will be tested.
data3 <- data2
ntrees <- c(300, 400, 500)
set.seed(125)
modellist <- list()
for (ntree in ntrees){
set.seed(1225)
rf<- train(factor(murder)~.,data=data3,
method="rf",trControl=control,tuneGrid=rfgrid,
linout = FALSE,replace=TRUE, importance=T, ntree=ntree,
nodesize=20, metric='ROC', sampsize=c(600, 600))
key <- toString(ntree)
modellist[[key]] <- rf
}
#Compare results
results <- resamples(modellist)
summary(results)
Call:
summary.resamples(object = results)
Models: 300, 400, 500
Number of resamples: 20
Accuracy
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
300 0.8250689 0.8367147 0.8445667 0.8443571 0.8521320 0.8640110 0
400 0.8236915 0.8329329 0.8486933 0.8446326 0.8543886 0.8626374 0
500 0.8236915 0.8337341 0.8480055 0.8454571 0.8554216 0.8640110 0
Kappa
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
300 0.6501377 0.6734288 0.6891433 0.6887145 0.7042657 0.7280220 0
400 0.6473829 0.6658626 0.6973882 0.6892642 0.7087751 0.7252747 0
500 0.6473829 0.6674683 0.6960168 0.6909142 0.7108423 0.7280220 0
We are going to choose ntrees = 500, since it has the highest Accuracy and Kappa.
We will start with the selection of mtry, which refers to the number of variables randomly selected in each split when constructing each tree in the ensemble. Choosing the right value of mtry is important because it affects the randomness and diversity of the randomforest trees. If mtry is too low, each tree may have a limited view of the available features, which may lead to insufficient fit. On the other hand, if mtry is too high, trees may be more similar, increasing the risk of underfitting. We will use the ntrees found with Bagging (500).
set.seed(12345)
rfgrid<-expand.grid(mtry=c(2, 3, 5, 7, 9, 10, 20))
rf<- train(factor(murder)~.,data=data3,
method="rf",trControl=control,tuneGrid=rfgrid,
linout = FALSE,ntree=500,nodesize=20,replace=TRUE,
importance=T)
rf
Random Forest
2908 samples
17 predictor
2 classes: 'No', 'Yes'
No pre-processing
Resampling: Cross-Validated (4 fold, repeated 5 times)
Summary of sample sizes: 2180, 2182, 2181, 2181, 2181, 2181, ...
Resampling results across tuning parameters:
mtry Accuracy Kappa
2 0.8426452 0.6852884
3 0.8519298 0.7038576
5 0.8567442 0.7134864
7 0.8588074 0.7176132
9 0.8599077 0.7198143
10 0.8597002 0.7193990
20 0.8588067 0.7176132
Accuracy was used to select the optimal model using the
largest value.
The final value used for the model was mtry = 9.
For mtry=9 Accuracy is well above 0.8 and Kappa even surparses 0.7. These are already good measures, through a line chart we can visualize whether the performance continues to increase with a growing mtry.
The line flattens out at mtry=12. We will choose this value since it is more advantageous to have a smaller mtry. Mtry indicates the amount of variables during the split of each node that are being taken into consideration. Having a low mtry is benificial for the diversity since at each node it is more likely that different variables are being considered (since a smaller number is picked), ensuring that there is a difference among the trees in the forest. This increases the randomness of the data and ultimately the capability to generalize.
Now we will find out the correct amount of sampsize for the Random Forest. For this we will use the mtry (12) and the ntree (500) obtained.
set.seed(12345)
rfgrid<-expand.grid(.mtry=12)
modellist <- list()
sizes <- c(10, 50, 100, 200)
for (sampsize in sizes){
set.seed(12345)
rf<- train(factor(murder)~.,data=data3,
method="rf",trControl=control,tuneGrid=rfgrid,
linout = FALSE,replace=TRUE, importance=T, ntree=500,
nodesize=20,metric='ROC', sampsize=c(sampsize,sampsize))
key <- toString(sampsize)
modellist[[key]] <- rf
}
#Compare results
results <- resamples(modellist)
summary(results)
Call:
summary.resamples(object = results)
Models: 10, 50, 100, 200
Number of resamples: 20
Accuracy
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
10 0.7637363 0.7810247 0.7840440 0.7867299 0.7924264 0.8060523 0
50 0.7706044 0.7952593 0.7990364 0.7995892 0.8033747 0.8186813 0
100 0.7867950 0.7988308 0.8068713 0.8066740 0.8154270 0.8236915 0
200 0.7870879 0.8084594 0.8121133 0.8131405 0.8204341 0.8296703 0
Kappa
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
10 0.5274725 0.5620152 0.5680872 0.5734565 0.5848464 0.6121317 0
50 0.5412088 0.5905281 0.5980648 0.5991799 0.6067493 0.6373626 0
100 0.5736006 0.5976468 0.6137556 0.6133475 0.6308540 0.6473829 0
200 0.5741758 0.6169213 0.6242340 0.6262811 0.6408503 0.6593407 0
Accuracy is already slightly above 0.8 and Kappa above the threshold of 0.6. But it seems like their values can still rise with a higher sampsize.
set.seed(12345)
rfgrid<-expand.grid(.mtry=12)
modellist <- list()
sizes <- c(300, 400, 500, 600)
for (sampsize in sizes){
set.seed(12345)
rf<- train(factor(murder)~.,data=data3,
method="rf",trControl=control,tuneGrid=rfgrid,
linout = FALSE,replace=TRUE, importance=T, ntree=500,
nodesize=20,metric='ROC', sampsize=c(sampsize,sampsize))
key <- toString(sampsize)
modellist[[key]] <- rf
}
#Compare results
results <- resamples(modellist)
summary(results)
Call:
summary.resamples(object = results)
Models: 300, 400, 500, 600
Number of resamples: 20
Accuracy
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
300 0.7939560 0.8112756 0.8198074 0.8194013 0.8288064 0.8418157 0
400 0.7994505 0.8219326 0.8280605 0.8280655 0.8340220 0.8502747 0
500 0.8118132 0.8282377 0.8349381 0.8350105 0.8426839 0.8502747 0
600 0.8090659 0.8346514 0.8412375 0.8414771 0.8503615 0.8622590 0
Kappa
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
300 0.5879121 0.6225347 0.6396022 0.6388017 0.6576269 0.6836511 0
400 0.5989011 0.6438757 0.6561360 0.6561301 0.6680441 0.7005495 0
500 0.6236264 0.6564768 0.6698743 0.6700209 0.6853647 0.7005495 0
600 0.6181319 0.6692929 0.6824729 0.6829540 0.7007183 0.7245179 0
Accuracy and Kappa seem to be the highest at sampsize 600.
Now we will compare bagging and random forest with our previous models.
medias9<-cruzadarfbin(data=data2,
vardep="murder",listconti= AIC_list,
listclass=c(""),grupos=4,sinicio=12345,repe=25,
nodesize=20,mtry=length(AIC_list),ntree=500,replace=TRUE,sampsize=600)
mtry Accuracy Kappa AccuracySD KappaSD
1 17 0.8212106 0.6424197 0.01410735 0.02821521
medias9 <- as.data.frame(medias9[[1]])
medias9$modelo <- "Bagging"
medias10<-cruzadarfbin(data=data2,
vardep="murder",listconti= AIC_list,
listclass=c(""),grupos=4,sinicio=12345,repe=25,
nodesize=20,mtry=12,ntree=500,replace=TRUE,sampsize=600)
mtry Accuracy Kappa AccuracySD KappaSD
1 12 0.8206882 0.6413748 0.01313456 0.02627187
medias10 <- as.data.frame(medias10[[1]])
medias10$modelo <- "RF"
union1<-rbind(medias1,medias2, medias3, medias4, medias5, medias6, medias7, medias8, medias9, medias10)
par(cex.axis=0.8)
boxplot(data=union1,col="red",tasa~modelo,main="Error Rate")
As can be seen they are among the highest performing models. However, Neuronal Network is still considerably better.
We will continue with the Gradient Boosting model. This is a method that combines multiple weak learners (decision trees) to create a robust prediction model. The final model prediction is obtained by aggregating the predictions of all the individual models in the ensemble.
To develop the best possible ensemble, we will take into account Accuracy and Kappa and tune four parameters:
GS_T0 <- Sys.time()
cluster <- makeCluster(detectCores() - 1)
registerDoParallel(cluster)
control<-trainControl(method = "repeatedcv",number=4,repeats=3,
savePredictions = "all",classProbs=TRUE)
set.seed(23)
gbmgrid <-expand.grid(n.minobsinnode=c(5,10,15),
shrinkage=c(0.1,0.01),n.trees=c(50, 100,200, 300),
interaction.depth=c(2, 4, 6, 8))
gbm<- train(murder~.,data=data3,
method="gbm",trControl=control,
tuneGrid=gbmgrid,distribution="bernoulli",verbose=FALSE)
plot(gbm)
gbm$bestTune
n.trees interaction.depth shrinkage n.minobsinnode
92 300 8 0.1 10
sorted_results <- gbm$results[order(-gbm$results$Accuracy), ]
print(sorted_results[1, c("Accuracy", "Kappa")])
Accuracy Kappa
92 0.8836566 0.7673131
We can see that the best model has a Kappa of 0.77 and an Accuracy of 0.88. The parameters of this model (number 92) are ntrees=300, interaction.depth=8, shrinkage=0.1 and n.minobsinnode = 10.
Lets use these parameters to see whether a tune of bag.fraction (which is the draw of observations) gives us a better performance of the model. For this parameter we will choose numbers 0.8 and 0.5.
set.seed(3)
gbmgrid <-expand.grid(n.minobsinnode=c(10),
shrinkage=c(0.1),n.trees=c(300),
interaction.depth=c(8))
set.seed(3)
gbm<- train(factor(murder)~.,data=data3,tuneGrid=gbmgrid,
method="gbm",trControl=control,distribution="bernoulli",
bag.fraction=1,verbose=FALSE)
gbm1<- train(factor(murder)~.,data=data3,tuneGrid=gbmgrid, method="gbm",trControl=control,distribution="bernoulli", bag.fraction=0.8,verbose=FALSE)
gbm2<- train(factor(murder)~.,data=data3,tuneGrid=gbmgrid, method="gbm",trControl=control,distribution="bernoulli", bag.fraction=0.5,verbose=FALSE)
sorted_results1 <- gbm1$results[order(-gbm1$results$Accuracy), ]
sorted_results <- gbm2$results[order(-gbm2$results$Accuracy), ]
combined_results <- rbind(sorted_results, sorted_results1)
combined_results
n.minobsinnode shrinkage n.trees interaction.depth Accuracy
1 10 0.1 300 8 0.8747110
2 10 0.1 300 8 0.8744859
Kappa AccuracySD KappaSD
1 0.7494199 0.01612700 0.03225362
2 0.7489688 0.01239853 0.02480232
We see that having a bag.fraction of 0.8 or 0.5 does not bring us much benefit in terms of Accuracy and Kappa, so we will continue without this parameter.
medias11<-cruzadagbmbin(data=data3,
vardep="murder",listconti=AIC_list,
listclass=c(""),grupos=4,sinicio=12345,repe=25,
n.minobsinnode=c(10),shrinkage=c(0.1),n.trees=c(300),
interaction.depth=c(8))
n.minobsinnode shrinkage n.trees interaction.depth Accuracy
1 10 0.1 300 8 0.8781428
Kappa AccuracySD KappaSD
1 0.756285 0.01109668 0.02219527
medias11 <- as.data.frame(medias11[[1]])
medias11$modelo <- "GBM"
union1<-rbind(medias1,medias2, medias3, medias4, medias5, medias6, medias7, medias8, medias9, medias10, medias11)
par(cex.axis=0.8)
boxplot(data=union1,col="red",tasa~modelo,main="Error Rate")
GBM proves to be the best model now.
We continue with the Support Vector Machine. The goal of this model is to find the best way to separate the data in different classes with a line, so we are going to use three types of kernel: linear, polynomial, RBF (radial basis function) and tuning the parameters C, Degree and Sigma.
For linear SVM only a suitable C needs to be found. This parameter is necessary to control the balance between bias and overfitting.
set.seed(1234)
control<-trainControl(method = "repeatedcv",number=4,repeats=5,
savePredictions = "all",classProbs=TRUE)
# Aplico caret y construyo modelo
SVMgrid <-expand.grid(C=c(0.5,1,2,5,10,30))
SVMlin<- train(murder~.,data=data3,
method="svmLinear",trControl=control,
tuneGrid=SVMgrid,replace=replace)
print(SVMlin$results)
C Accuracy Kappa AccuracySD KappaSD
1 0.5 0.8025430 0.6050879 0.01264452 0.02529202
2 1.0 0.8018555 0.6037118 0.01303748 0.02607800
3 2.0 0.8024745 0.6049512 0.01276444 0.02553110
4 5.0 0.8017176 0.6034362 0.01234997 0.02470681
5 10.0 0.8026119 0.6052261 0.01320621 0.02641420
6 30.0 0.8021310 0.6042636 0.01241081 0.02482416
We can see that all Cs surpass an Accuracy of 0.8, but Kappa is just slightly bigger than 0.6. C =10 seems to have the highest Accuracy and Kappa. Lets choose some parameters around 10 and visualize the results.
set.seed(13)
SVMgrid<-expand.grid(C=c(7, 9, 10,12, 15))
control<-trainControl(method = "repeatedcv",number=4, allowParallel = TRUE,
repeats = 5, savePredictions = "all",search = 'grid')
SVMlin<- train(murder~.,data=data3,
method="svmLinear",trControl=control,
tuneGrid=SVMgrid,verbose=FALSE)
completo<-data.frame()
completo<-rbind(completo,SVMlin$results)
completo<-completo[order(completo$Accuracy),]
ggplot(completo, aes(x=factor(C), y=Accuracy)) +
geom_point(position=position_dodge(width=0.5),size=3)
completo<-data.frame()
completo<-rbind(completo,SVMlin$results)
completo<-completo[order(completo$Kappa),]
ggplot(completo, aes(x=factor(C), y=Kappa)) +
geom_point(position=position_dodge(width=0.5),size=3)
We can see that the best C is C=7 for SMV Linear.
Apart from C, we are also going to find out which is the best sigma. Sigma refers to the width of the kernel. A wider kernel means that more distant points are considered more similar, the opposite is true for a narrow kernel.
set.seed(123456)
SVMgrid<-expand.grid(C=c(1, 5,10, 15, 20),
sigma=c(0.0001,0.005,0.01,0.05,0.7))
control<-trainControl(method = "cv",
number=4,savePredictions = "all")
SVMRBF<- train(murder~.,data=data3,
method="svmRadial",trControl=control,
tuneGrid=SVMgrid,verbose=FALSE)
sorted_results <- SVMRBF$results[order(-SVMRBF$results$Accuracy), ]
print(sorted_results[1:3, c("Accuracy", "Kappa", "C", "sigma")])
Accuracy Kappa C sigma
10 0.8720826 0.7441602 5 0.7
15 0.8703656 0.7407260 10 0.7
25 0.8693306 0.7386570 20 0.7
All the top 3 results have a Kappa higher than 0.7 and an Accuracy close to 0.9. Lets see whether there are better Cs and sigmas close to these parameters of the three models and let’s also visualize it.
set.seed(1234)
SVMgrid<-expand.grid(C=c(3, 5, 7, 10, 12, 15, 20, 25),
sigma=c(0.7, 0.9, 1, 1.5, 2, 2.5))
control<-trainControl(method = "repeatedcv",
number=4,savePredictions = "all")
SVMRBF <- train(murder~.,data=data3,
method="svmRadial",trControl=control,
tuneGrid=SVMgrid,verbose=FALSE)
sorted_results <- SVMRBF$results[order(-SVMRBF$results$Accuracy), ]
print(sorted_results[1:3, c("Accuracy", "Kappa", "C", "sigma")])
Accuracy Kappa C sigma
10 0.8810088 0.7620187 5 1.5
16 0.8806654 0.7613319 7 1.5
4 0.8806649 0.7613313 3 1.5
completo<-data.frame()
completo<-rbind(completo,SVMRBF$results)
completo<-completo[order(completo$Kappa),]
ggplot(completo, aes(x=factor(C), y=Kappa,
color=factor(sigma))) +
geom_point(position=position_dodge(width=0.5),size=3)
completo<-rbind(completo,SVMRBF$results)
completo<-completo[order(completo$Accuracy),]
ggplot(completo, aes(x=factor(C), y=Accuracy,
color=factor(sigma))) +
geom_point(position=position_dodge(width=0.5),size=3)
The best sigma is 1.5 and the best C is 5 for the best Accuracy and Kappa
For this kernel we will use degree and scale apart from C. Degree refers to the flexibility and shape of the curve separating the different data classes. A degree that is too high can lead to overfitting, as the curve will be more complex.
set.seed(1235)
control<-trainControl(method = "repeatedcv",number=4, allowParallel = TRUE,
savePredictions = "all",search = 'grid')
SVMgrid <-expand.grid(C=c(0.1, 0.3),degree=c(2,3,4),scale=c(1, 2, 3))
SVM_poli<- train(murder~.,data=data3,
method="svmPoly",trControl=control,
tuneGrid=SVMgrid,replace=replace)
sorted_results <- SVM_poli$results[order(-SVM_poli$results$Accuracy), ]
print(sorted_results[1:3, c("Accuracy", "Kappa", "C", "degree", "scale")])
Accuracy Kappa C degree scale
16 0.8359710 0.6719487 0.3 4 1
7 0.8342516 0.6685107 0.1 4 1
9 0.8253117 0.6506336 0.1 4 3
We can see that the model 16 seems to be the best. Lets see if we can find a higher Accuracy and Kappa with similar parameters.
set.seed(1234)
control<-trainControl(method = "repeatedcv",number=4, allowParallel = TRUE,
savePredictions = "all",search = 'grid')
SVMgrid <-expand.grid(C=c( 0.3, 0.5, 0.8, 1),degree=c(4, 6, 8),scale=c(1, 1.5, 2))
SVM_poli<- train(murder~.,data=data3,
method="svmPoly",trControl=control,
tuneGrid=SVMgrid,replace=replace)
sorted_results <- SVM_poli$results[order(-SVM_poli$results$Accuracy), ]
# Print the top 3 rows with highest accuracy and kappa values
print(sorted_results[1:3, c("Accuracy", "Kappa", "C", "degree", "scale")])
Accuracy Kappa C degree scale
4 0.8315006 0.6630115 0.3 6 1
13 0.8315006 0.6630115 0.5 6 1
22 0.8315006 0.6630115 0.8 6 1
set.seed(1235)
completo<-data.frame()
completo<-rbind(completo,SVM_poli$results)
completo<-completo[order(completo$Accuracy),]
plot1 <- ggplot(completo, aes(x=factor(C), y=Accuracy,
color=factor(scale),pch=factor(degree))) +
geom_point(position=position_dodge(width=0.5),size=3)
plot1
completo<-data.frame()
completo<-rbind(completo,SVM_poli$results)
completo<-completo[order(completo$Kappa),]
plot1 <- ggplot(completo, aes(x=factor(C), y=Kappa,
color=factor(scale),pch=factor(degree))) +
geom_point(position=position_dodge(width=0.5),size=3)
plot1
For SVM Polinomial we choose the parameters C=0.28, degree=6, scale=1
medias12<-cruzadaSVMbin(data=data3,
vardep="murder",listconti= AIC_list,
listclass=c(""),grupos=4,sinicio=12345,repe=25,C=7)
C Accuracy Kappa AccuracySD KappaSD
1 7 0.8007431 0.6014852 0.01245616 0.02491299
medias12 <- as.data.frame(medias12[[1]])
medias12$modelo="SVM_lineal"
medias13<-cruzadaSVMbinRBF(data=data3,
vardep="murder",listconti= AIC_list,
listclass=c(""),grupos=4,sinicio=12345,repe=25,
C=5,sigma=1.5)
C sigma Accuracy Kappa AccuracySD KappaSD
1 5 1.5 0.8778536 0.7557054 0.01327893 0.02656247
medias13 <- as.data.frame(medias13[[1]])
medias13$modelo="SVM_RBF"
medias14<- cruzadaSVMbinPoly(data=data3,
vardep="murder",listconti= AIC_list,
listclass=c(""),grupos=4,sinicio=12345,repe=25,
C=0.28,degree=6,scale=1)
line search fails 5.837733e-07 -0.0004815689 -3.641992e-05 -9.81476e-12 2.724726e-17 5.016644e-15 -9.923923e-22 C degree scale Accuracy Kappa AccuracySD KappaSD
1 0.28 6 1 0.5658596 0.1320145 0.05830089 0.116338
medias14 <- as.data.frame(medias14[[1]])
medias14$modelo="SVM_Poli"
union1<-rbind(medias1,medias2,medias3,medias4, medias4, medias5, medias6, medias7, medias8, medias9, medias10, medias11, medias12
,medias13,
medias14
)
par(cex.axis=0.4)
boxplot(data=union1,col="red",tasa~modelo,main="Error Rate")
We can see that SVM Polynomial is the worst model by far for both AUC and Error Rate. The variance of its values is very high in both cases. In contrast, SVM RBF is among the best ones.
As one of the last steps, we are going to do an ensemble to combine our models. We have already tuned a variety of models, so we will use the best 5 algorithms (Neuronal Network, Bagging, GBM, SVM RBF and RF).
vardep<-"murder"
listconti<-c("racePctWhite", "TotalPctDiv", "pctWWage", "blackPerCap",
"PolicOperBudg", "PctPolicMinor", "LemasPctOfficDrugUn", "PctForeignBorn",
"MedRentPctHousInc", "PolicAveOTWorked", "PersPerFam", "PolicPerPop",
"racepctblack", "PctBornSameState", "PctPolicWhite", "indianPerCap",
"PctWorkMom")
variables2<-c("racePctWhite", "TotalPctDiv", "pctWWage", "blackPerCap",
"PolicOperBudg", "PctPolicMinor", "LemasPctOfficDrugUn", "PctForeignBorn",
"MedRentPctHousInc", "PolicAveOTWorked", "PersPerFam", "PolicPerPop",
"racepctblack", "PctBornSameState", "PctPolicWhite", "indianPerCap",
"PctWorkMom")
data4<-data2[,c(variables2,vardep)]
listclass<-c("")
grupos<-4
sinicio<-1234
repe<-50
set.seed(1234)
archivo_m$murder<-ifelse(archivo_m$murder==1,"Yes","No")
archivo_m <- na.omit(archivo_m)
archivo_m <-as_tibble(archivo_m)
archivo_m <- as.data.frame(archivo_m)
dput(names(archivo_m))
c("murder", "perCapInc", "blackPerCap", "indianPerCap", "AsianPerCap",
"OtherPerCap", "pctWWage", "MedRentPctHousInc", "MedOwnCostPctIncNoMtg",
"racepctblack", "racePctWhite", "agePct12t29", "PctImmigRecent",
"PctSpeakEnglOnly", "PctForeignBorn", "PctBornSameState", "PctSameHouse85",
"PolicReqPerOffic", "PolicPerPop", "RacialMatchCommPol", "PctPolicWhite",
"PctPolicAsian", "PctPolicMinor", "NumKindsDrugsSeiz", "PolicAveOTWorked",
"LandArea", "PctUsePubTrans", "PolicOperBudg", "LemasPctPolicOnPatr",
"LemasGangUnitDeploy", "LemasPctOfficDrugUn", "TotalPctDiv",
"PersPerFam", "PctWorkMom")
medias1bis<-cruzadaSVMbinRBF(data=data3,
vardep=vardep,listconti=listconti,
listclass=listclass,grupos=grupos,
sinicio=sinicio,repe=repe,
C=5,sigma=1.5)
C sigma Accuracy Kappa AccuracySD KappaSD
1 5 1.5 0.8773853 0.7547708 0.01224981 0.02449797
predi1<-as.data.frame(medias1bis[2])
predi1$svmRadial<-predi1$Yes
medias2bis<-cruzadaavnnetbin(data=data3,
vardep=vardep,listconti=listconti,
listclass=listclass,grupos=grupos,sinicio=sinicio,repe=repe,repeticiones=5,itera=101,
size=c(30),decay=c(0.01))
size decay bag Accuracy Kappa AccuracySD KappaSD
1 30 0.01 FALSE 0.8614919 0.7229834 0.01396391 0.0279286
predi2<-as.data.frame(medias2bis[2])
predi2$network<-predi2$Yes
medias3bis<-cruzadagbmbin(data=data3,
vardep=vardep,listconti=listconti,
listclass=listclass,grupos=grupos,sinicio=sinicio,repe=repe,
n.minobsinnode=10,shrinkage=0.1,n.trees=300,
interaction.depth=8)
n.minobsinnode shrinkage n.trees interaction.depth Accuracy
1 10 0.1 300 8 0.8781907
Kappa AccuracySD KappaSD
1 0.7563819 0.01162715 0.02325265
predi3<-as.data.frame(medias3bis[2])
predi3$gbm<-predi3$Yes
medias4bis<-cruzadarfbin(data=data3,
vardep=vardep,listconti= listconti,
listclass=listclass,grupos=grupos,sinicio=sinicio,repe=repe,
nodesize=20,mtry=length(AIC_list),ntree=500,replace=TRUE,sampsize=600)
mtry Accuracy Kappa AccuracySD KappaSD
1 17 0.8212026 0.6424048 0.01320181 0.02640279
predi4<-as.data.frame(medias4bis)
predi4$bagging <-predi4$Yes
medias5bis<- cruzadarfbin(data=data3,
vardep=vardep,listconti=listconti,
listclass=listclass,grupos=grupos,
sinicio=sinicio,repe=repe,
nodesize=20,mtry=12,ntree=500,replace=TRUE,sampsize=600)
mtry Accuracy Kappa AccuracySD KappaSD
1 12 0.8209 0.6417999 0.01317354 0.02634695
predi5<-as.data.frame(medias5bis[2])
predi5$RF<-predi5$Yes
set.seed(123)
unipredi<-cbind(predi1, predi2, predi3, predi4, predi5)
#we eliminate duplicates
unipredi<- unipredi[, !duplicated(colnames(unipredi))]
#Combinations of 2
unipredi$predi6<-(unipredi$bagging+unipredi$network)/2
unipredi$predi7<-(unipredi$bagging+unipredi$gbm)/2
unipredi$predi8<-(unipredi$bagging+unipredi$RF)/2
unipredi$predi9<-(unipredi$bagging+unipredi$svmRadial)/2
unipredi$predi10<-(unipredi$svmRadial+unipredi$network)/2
unipredi$predi11<-(unipredi$svmRadial+unipredi$gbm)/2
unipredi$predi12<-(unipredi$network+unipredi$gbm)/2
unipredi$predi13<-(unipredi$network+unipredi$RF)/2
unipredi$predi14<-(unipredi$gbm+unipredi$RF)/2
unipredi$predi15<-(unipredi$svmRadial+unipredi$RF)/2
#Combinations of 3
unipredi$predi16<-(unipredi$gbm+unipredi$network+unipredi$svmRadial)/3
unipredi$predi17<-(unipredi$gbm+unipredi$bagging+unipredi$svmRadial)/3
unipredi$predi18<-(unipredi$gbm+unipredi$network+unipredi$bagging)/3
unipredi$predi19<-(unipredi$svmRadial+unipredi$bagging+unipredi$network)/3
unipredi$predi20<-(unipredi$bagging+unipredi$network+unipredi$RF)/3
unipredi$predi21<-(unipredi$svmRadial+unipredi$bagging+unipredi$RF)/3
unipredi$predi22<-(unipredi$RF+unipredi$bagging+unipredi$gbm)/3
unipredi$predi23<-(unipredi$svmRadial+unipredi$RF+unipredi$network)/3
unipredi$predi24<-(unipredi$svmRadial+unipredi$RF+unipredi$gbm)/3
unipredi$predi25<-(unipredi$gbm+unipredi$RF+unipredi$network)/3
#Combinations of 4
unipredi$predi26<-(unipredi$bagging+unipredi$network+unipredi$RF+unipredi$svmRadial)/4
unipredi$predi27<-(unipredi$bagging+unipredi$RF+unipredi$svmRadial+unipredi$gbm)/4
unipredi$predi28<-(unipredi$network+unipredi$RF+unipredi$svmRadial+unipredi$gbm)/4
unipredi$predi29<-(unipredi$bagging+unipredi$network+unipredi$svmRadial+unipredi$gbm)/4
unipredi$predi30<-(unipredi$bagging+unipredi$network+unipredi$RF+unipredi$gbm)/4
#All 5
unipredi$predi31<-(unipredi$gbm+unipredi$network+unipredi$svmRadial+unipredi$RF+unipredi$bagging)/5
c("pred", "obs", "No", "Yes", "rowIndex", "C", "sigma", "Resample",
"Fold", "Rep", "svmRadial", "size", "decay", "bag", "network",
"n.minobsinnode", "shrinkage", "n.trees", "interaction.depth",
"gbm", "tasa", "auc", "mtry", "bagging", "RF", "predi6", "predi7",
"predi8", "predi9", "predi10", "predi11", "predi12", "predi13",
"predi14", "predi15", "predi16", "predi17", "predi18", "predi19",
"predi20", "predi21", "predi22", "predi23", "predi24", "predi25",
"predi26", "predi27", "predi28", "predi29", "predi30", "predi31"
)
listado<-c("svmRadial", "network", "RF", "bagging","gbm", "predi6", "predi7",
"predi8", "predi9", "predi10", "predi11", "predi12", "predi13", "predi14", "predi15", "predi16", "predi17", "predi18", "predi19", "predi20"
, "predi21", "predi22", "predi23", "predi24","predi25", "predi26", "predi28", "predi29", "predi30","predi31")
tasafallos<-function(x,y) {
confu<-confusionMatrix(x,y)
tasa<-confu[[3]][1]
return(tasa)
}
auc<-function(x,y) {
curvaroc<-roc(response=x,predictor=y)
auc<-curvaroc$auc
return(auc)
}
repeticiones<-nlevels(factor(unipredi$Rep))
unipredi$Rep<-as.factor(unipredi$Rep)
unipredi$Rep<-as.numeric(unipredi$Rep)
medias0<-data.frame(c())
for (prediccion in listado)
{
unipredi$proba<-unipredi[,prediccion]
unipredi[,prediccion]<-ifelse(unipredi[,prediccion]>0.5,"Yes","No")
for (repe in 1:repeticiones)
{
paso <- unipredi[(unipredi$Rep==repe),]
pre<-factor(paso[,prediccion])
archi<-paso[,c("proba","obs")]
archi<-archi[order(archi$proba),]
obs<-paso[,c("obs")]
tasa=1-tasafallos(pre,obs)
t<-as.data.frame(tasa)
t$modelo<-prediccion
auc<-suppressMessages(auc(archi$obs,archi$proba))
t$auc<-auc
medias0<-rbind(medias0,t)
}
}
We display the sorted boxplots to see the Error Rate and AUC.
set.seed(1233)
medias0$modelo <- with(medias0,
reorder(modelo,tasa, mean))
par(cex.axis=0.7,las=2)
boxplot(data=medias0,tasa~modelo,col="red", main='Error Rate')
tablamedias2<-medias0 %>%
group_by(modelo) %>%
summarize(auc=mean(auc))
tablamedias2<-tablamedias2[order(-tablamedias2$auc),]
medias0$modelo <- with(medias0,
reorder(modelo,auc, mean))
par(cex.axis=0.7,las=2)
boxplot(data=medias0,auc~modelo,col="blue", main='AUC')
We can see that the top 5 models for AUC are represented in the top 5 Error rate. It can be observed that in both cases the top 5 consists of ensembles, proving that combining our models is advantageous.
Top 5 AUC: predi11, predi16, predi15, predi9, predi10 and predi15 has the lowest Error rate of them on place 1.
Top 5 Error rate: predi15, predi9, predi11, predi10, predi16 and predi11 has the highest AUC on place 1.
Predi11 and Predi15 seem to be the best overall models.
For the top 5 AUC and Error Rate ensembles SVM Radial is present in all of them.
Top 5 Accuracy:
| Rank | Ensemble | Model included |
|---|---|---|
| 1 | Predi11 | svmRadial, gbm |
| 2 | Predi16 | svmRadial, gbm, network |
| 3 | Predi15 | svmRadial, RF |
| 4 | Predi9 | svmRadial, bagging |
| 5 | Predi10 | svmRadial, Network |
Top 5 Error Rate:
| Rank | Ensemble | Models included |
|---|---|---|
| 1 | Predi15 | svmRadial, RF |
| 2 | Predi9 | svmRadial, bagging |
| 3 | Predi11 | svmRadial, gbm |
| 4 | Predi10 | svmRadial, network |
| 5 | Predi16 | svmRadial, network, gbm |
We will do a hypothesis tests between Predi11 and Predi15 to see which one to choose.
set.seed(232)
listamodelos<-c("predi11","predi15")
datacontraste<-medias0[which(medias0$modelo%in%listamodelos),]
# Erro Rates
res <- t.test(datacontraste$tasa ~datacontraste$modelo)
res
Welch Two Sample t-test
data: datacontraste$tasa by datacontraste$modelo
t = -3.3559, df = 95.564, p-value = 0.001136
alternative hypothesis: true difference in means between group predi15 and group predi11 is not equal to 0
95 percent confidence interval:
-0.004575374 -0.001174282
sample estimates:
mean in group predi15 mean in group predi11
0.09471802 0.09759285
The p-value is below 0.05. Therefore we can conclude that the difference of both Error Rates means is significantly different from zero. Predi15 is significantly better than predi11 in Error rate.
Welch Two Sample t-test
data: datacontraste$auc by datacontraste$modelo
t = -2.6644, df = 96.871, p-value = 0.009034
alternative hypothesis: true difference in means between group predi15 and group predi11 is not equal to 0
95 percent confidence interval:
-0.0028801632 -0.0004210653
sample estimates:
mean in group predi15 mean in group predi11
0.9513732 0.9530239
Here the same is the case. The difference of both AUC means is significantly different from zero. Therefore, as expected, the predi11 model is significantly better than predi15 in AUC.
However, the p-value for the Error Rate testing is further below 0.05 indicating that the difference in Error Rate is more significant than the difference in AUC. Therefore predi15 will be selected as best model.
| Measure | Predi11 | Predi15 | p-value |
|---|---|---|---|
| AUC | 0.9530 | 0.9514 | 0.0090 |
| Error Rate | 0.0976 | 0.0947 | 0.0011 |
The first step we will do in our conclusion part is to construct the confusion matrix based on our ensemble model predi15 The confusion matrix gives an overview of the agreement between the predictions of our model and the actual values. We will use repeated cv with 6 replicates.
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 7978 858
Yes 746 7866
Accuracy : 0.9081
95% CI : (0.9037, 0.9123)
No Information Rate : 0.5
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.8161
Mcnemar's Test P-Value : 0.005579
Sensitivity : 0.9017
Specificity : 0.9145
Pos Pred Value : 0.9134
Neg Pred Value : 0.9029
Prevalence : 0.5000
Detection Rate : 0.4508
Detection Prevalence : 0.4936
Balanced Accuracy : 0.9081
'Positive' Class : Yes
| Prediction/Reference | No | Yes |
|---|---|---|
| No | 7978 | 858 |
| Yes | 746 | 7866 |
| Measure | Score |
|---|---|
| Accuracy | 0.9081 |
| Kappa | 0.8161 |
| Sensitivity | 0.9017 |
| Specificity | 0.9145 |
The metrics prove to be exceptionally good. The True Positive rate (Sensitivity) is 90% and the True Negative Rate is 91%. Accuracy rate is also 90% which means that 9 out of 10 data points are being classified correctly. Also Kappa is high when considering that the threshold for an acceptable value is 0.6. This strong predictive capability of the model can be attributed to the oversampling done for the minority class during the explorative phase. The two classes were perfectly balanced from then on and therefore there is no significant difference between Sensitivity and Specificty since the model was not trained more on one of the two classes. Hence, no adjustment of the threshold is necessary.
set.seed(1234)
data3$murder <- as.factor(data3$murder)
data3$murder <- relevel(data3$murder,ref="No")
var <- c("racePctWhite", "TotalPctDiv", "pctWWage", "blackPerCap",
"PolicOperBudg", "PctPolicMinor", "LemasPctOfficDrugUn", "PctForeignBorn",
"MedRentPctHousInc", "PolicAveOTWorked", "PersPerFam", "PolicPerPop",
"racepctblack", "PctBornSameState", "PctPolicWhite", "indianPerCap",
"PctWorkMom")
vardep <- "murder"
data4<-data3[,c(var,"murder")]
logistica<-glm(factor(murder)~.,data=data4,family="binomial")
summary(logistica)
Call:
glm(formula = factor(murder) ~ ., family = "binomial", data = data4)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.2170 -0.6984 -0.0274 0.6847 2.5775
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.38252 0.07386 5.179 2.23e-07 ***
racePctWhite -1.10047 0.17659 -6.232 4.61e-10 ***
TotalPctDiv 0.57164 0.06273 9.113 < 2e-16 ***
pctWWage -0.30159 0.06704 -4.498 6.85e-06 ***
blackPerCap -0.17502 0.07262 -2.410 0.015946 *
PolicOperBudg 0.93382 0.31275 2.986 0.002828 **
PctPolicMinor 0.76118 0.21739 3.501 0.000463 ***
LemasPctOfficDrugUn 0.25866 0.06891 3.754 0.000174 ***
PctForeignBorn -0.24932 0.09881 -2.523 0.011630 *
MedRentPctHousInc 0.15239 0.05775 2.639 0.008318 **
PolicAveOTWorked -0.16298 0.06112 -2.667 0.007658 **
PersPerFam 0.21978 0.08602 2.555 0.010616 *
PolicPerPop -0.17377 0.07021 -2.475 0.013331 *
racepctblack 0.32598 0.16874 1.932 0.053371 .
PctBornSameState -0.09319 0.05908 -1.577 0.114703
PctPolicWhite 0.26132 0.16028 1.630 0.103021
indianPerCap -0.10508 0.07364 -1.427 0.153609
PctWorkMom -0.08553 0.06030 -1.418 0.156063
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4031.3 on 2907 degrees of freedom
Residual deviance: 2618.4 on 2890 degrees of freedom
AIC: 2654.4
Number of Fisher Scoring iterations: 7
| Variable | Coefficient |
|---|---|
| racePctWhite | -1.10047 |
| TotalPctDiv | 0.57164 |
| pctWWage | -0.30159 |
| blackPerCap | -0.17502 |
| PolicOperBudg | 0.93382 |
| PctPolicMinor | 0.76118 |
| LemasPctOfficDrugUn | 0.25866 |
| PctForeignBorn | -0.24932 |
| MedRentPctHousInc | 0.15239 |
| PolicAveOTWorked | -0.16298 |
| PersPerFam | 0.21978 |
| PolicPerPop | -0.17377 |
| racepctblack | 0.32598 |
| PctBornSameState | -0.09319 |
| PctPolicWhite | 0.26132 |
| indianPerCap | -0.10508 |
| PctWorkMom | -0.08553 |
The table of coefficients shows that the PolicOprBudg variable has the highest positive coefficient (0.93). This means that, according to the model, having a high Police Budget is associated with a higher probability of the community having above-avergae murder rates. Similarly, communities with the a high PctPolicMinor also have a higher probability of high murder rate.
On the contrary, the coefficients of racePctWhite (the highest in absolute terms), has a negative impact on the prediction. That is, a high percentage of Caucasians is associated with a lower probability of having above national average murder rates.
In total we have only 1 very important variable (> +-1.0): racePctWhite.
In order to also offer an easy interpretable analysis to the decision-makers, a decision tree will be executed. It will also serve to see which variables are being regarded as important and how they overlap with the ones above. When creating the decision tree its parameters will be chosen in such way that at least 10 observations must be included in every leaf (minbucket=10), which in turn prevents small splits at node level, a tree that is too deep and overfitting. Furthermore, to guarantee the most effective splits by decreasing the impurity of the corresponding child node, the Gini impurity criterion will be taken advantage of.
# Arbol simple
library(rpart)
library(rpart.plot)
set.seed(123)
arbol1 <- rpart(murder ~ ., data = data3,
minbucket =10,method = "class",parms=list(split="gini"),cp=0)
rpart.plot(arbol1,extra=105,nn=TRUE, tweak=1.2)
The variables that were selected by the decision tree are racePctWhite, blackPerCap, TotalPctDiv, MedRentPctHousInc, pctWWage, indianPerCap, PctWorkMom, PctForeignBorn, PctBornSameState, PersPerFam and PctPolicMinor. All of them appear as well in the selection done by AIC. Interestingly only one police-related factor was chosen by the decision tree as opposed to six by AIC. It can be assumed that this is advantageous when recalling the high numbers of NAs present in the variables of this group.
The best models created through this analysis are:
| Model | AUC | Error Rate |
|---|---|---|
| SVM RBF | 0.9383 | 0.1226 |
| GBM | 0.9297 | 0.1218 |
| Neuronal Network | 0.9208 | 0.1385 |
| Predi11 | 0.9530 | 0.0976 |
| Predi15 | 0.9514 | 0.0947 |
Looking at this final table comparing the AUC and Error Rates of the best models of the ensemble and our simpler models developed throughout this work, we can see that the best model for AUC is predi11, a combination of the svmRadial and GBM models. However, for Error Rate Predi15 has the lowest score. Nevertheless, after having made the hypothesis testing we can conclude that it is better to stay with predi15, since there since the difference in AUC between the two models’ means is higher and more significant than the difference in Error Rate.